Calculating percentile rank of sample dataset compared to reference dataset in R

Thu, Aug 22, 2019 12:23 AM

Hi,

sapply(c("iron", "nitrate"), function(x) round(approx(y =
1:nrow(df_ref), x = df_ref[, x], xout = df_sample[, x])$y/10))

should do the trick with base R:::approx() as workhorse.

You need to replace the /10 by a value corresponding to the length of
your reference database (e.g. if there are 500 rows only, divide by 5)

The results differs slightly from the solution of Akos by assigning a
value of 0.2651 to percentile rank 27 instead of 26.


Cheers!

On Thu, 22 Aug 2019 at 08:29, Glatthorn, Jonas <jglatth at gwdg.de> wrote:

Dear Matt,

I believe the ecdf() function can do as well what you are looking for:

ref_ecdf <- sapply(df_ref, FUN = ecdf)

and then apply each function in ref_ecdf to the corresponding column in
df_sample. Either with a for loop or (my preference) using functionals:

df_sample_rank <- purrr::map2_dfc(ref_ecdf, purrr::map(df_sample[-1],
list), do.call)

all the best

Jonas

-----Original Message-----
From: R-sig-ecology <r-sig-ecology-bounces at r-project.org> On Behalf Of
Bede-Fazekas ?kos
Sent: Thursday, 22 August 2019 08:01
To: r-sig-ecology at r-project.org
Subject: Re: [R-sig-eco] FW: Calculating percentile rank of sample dataset
compared to reference dataset in R

Dear Matthew,

here is one, maybe not the fastest/shortest, solution:
percentiles <- apply(X = df_ref, MARGIN = 2, FUN = quantile, probs =
seq(from = 0, to = 1, length.out = 101)[-1]) df_sample$percentile_rank <-
vapply(X = colnames(df_sample)[-1], FUN.VALUE = numeric(nrow(df_sample)),
FUN = function(variable_name) findInterval(x = df_sample[, variable_name,
drop = TRUE], vec = percentiles[, variable_name, drop = TRUE]))

HTH,
?kos Bede-Fazekas
Hungarian Academy of Sciences

2019.08.22. 0:54 keltez?ssel, Shank, Matthew ?rta:

Hello R-sig-ecology mailing list,



I?m working on a mutlivariate water quality index where the

concentration of parameter i at site j is normalized by calculating the
percentile rank of the value using a much larger reference dataset.



As an example, I have generated a sample dataset of water quality

parameters (df_sample) and a larger reference dataset (df_ref). I?d like to
calculate the percentile rank of each parameter, at each site, using a
reference dataset of a much larger size.



Example data is below. If anyone has a solution that avoids for loops

that would be preferred.





#generate sample data

df_sample <- data.frame(site = letters[1:10], iron = runif(10, min=0,
max=1), nitrate = runif(10, min=0, max=10))

df_sample





#generate reference dataset

df_ref <- data.frame(iron = seq(0, 1, length.out = 1000), nitrate =
seq(0, 10, length.out = 1000))

df_ref

# now would like to calculate percentile rank of iron and nitrate at
all sites (a:j) based on identical columns in df_ref and include as a
new column in df_sample



Many thanks,
|><?Ma??tt?)o>


      [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Calculating percentile rank of sample dataset compared to reference dataset in R

Thread (4 messages)