Calculating percentile rank of sample dataset compared to reference dataset in R
Hi,
sapply(c("iron", "nitrate"), function(x) round(approx(y =
1:nrow(df_ref), x = df_ref[, x], xout = df_sample[, x])$y/10))
should do the trick with base R:::approx() as workhorse.
You need to replace the /10 by a value corresponding to the length of
your reference database (e.g. if there are 500 rows only, divide by 5)
The results differs slightly from the solution of Akos by assigning a
value of 0.2651 to percentile rank 27 instead of 26.
Cheers!
On Thu, 22 Aug 2019 at 08:29, Glatthorn, Jonas <jglatth at gwdg.de> wrote:
Dear Matt, I believe the ecdf() function can do as well what you are looking for: ref_ecdf <- sapply(df_ref, FUN = ecdf) and then apply each function in ref_ecdf to the corresponding column in df_sample. Either with a for loop or (my preference) using functionals: df_sample_rank <- purrr::map2_dfc(ref_ecdf, purrr::map(df_sample[-1], list), do.call) all the best Jonas -----Original Message----- From: R-sig-ecology <r-sig-ecology-bounces at r-project.org> On Behalf Of Bede-Fazekas ?kos Sent: Thursday, 22 August 2019 08:01 To: r-sig-ecology at r-project.org Subject: Re: [R-sig-eco] FW: Calculating percentile rank of sample dataset compared to reference dataset in R Dear Matthew, here is one, maybe not the fastest/shortest, solution: percentiles <- apply(X = df_ref, MARGIN = 2, FUN = quantile, probs = seq(from = 0, to = 1, length.out = 101)[-1]) df_sample$percentile_rank <- vapply(X = colnames(df_sample)[-1], FUN.VALUE = numeric(nrow(df_sample)), FUN = function(variable_name) findInterval(x = df_sample[, variable_name, drop = TRUE], vec = percentiles[, variable_name, drop = TRUE])) HTH, ?kos Bede-Fazekas Hungarian Academy of Sciences 2019.08.22. 0:54 keltez?ssel, Shank, Matthew ?rta:
Hello R-sig-ecology mailing list, I?m working on a mutlivariate water quality index where the
concentration of parameter i at site j is normalized by calculating the percentile rank of the value using a much larger reference dataset.
As an example, I have generated a sample dataset of water quality
parameters (df_sample) and a larger reference dataset (df_ref). I?d like to calculate the percentile rank of each parameter, at each site, using a reference dataset of a much larger size.
Example data is below. If anyone has a solution that avoids for loops
that would be preferred.
#generate sample data
df_sample <- data.frame(site = letters[1:10], iron = runif(10, min=0,
max=1), nitrate = runif(10, min=0, max=10))
df_sample
#generate reference dataset
df_ref <- data.frame(iron = seq(0, 1, length.out = 1000), nitrate =
seq(0, 10, length.out = 1000))
df_ref
# now would like to calculate percentile rank of iron and nitrate at
all sites (a:j) based on identical columns in df_ref and include as a
new column in df_sample
Many thanks,
|><?Ma??tt?)o>
[[alternative HTML version deleted]]
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology _______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology