I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions? Thanks, Sean
Rank-based p-value on large dataset
6 messages · Deepayan Sarkar, Adaikalavan Ramasamy, Sean Davis
On Thursday 03 March 2005 16:22, Sean Davis wrote:
I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions?
The obvious thing to do would be p.val = 1 - ecdf(x)(y) wouldn't it? On a 1.1 GHz Athlon, I get
x <- rnorm(130000) y <- rnorm(80000) system.time(p.val <- 1 - ecdf(y)(x))
[1] 1.03 0.03 1.06 0.00 0.00 -Deepayan
On Thursday 03 March 2005 16:32, Deepayan Sarkar wrote:
On Thursday 03 March 2005 16:22, Sean Davis wrote:
I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions?
The obvious thing to do would be p.val = 1 - ecdf(x)(y)
or rather: p.val = 1 - ecdf(y)(x)
wouldn't it? On a 1.1 GHz Athlon, I get
x <- rnorm(130000) y <- rnorm(80000) system.time(p.val <- 1 - ecdf(y)(x))
[1] 1.03 0.03 1.06 0.00 0.00
One solution is to cut() 'x' according to the breaks defined by 'y'.
Using cut with labels=FALSE is really fast. See a simulation below.
However the accuracy depends on the number of ties you have in your
"empirical" distribution. I have tried to simulate with the round()
function below.
# simulate data
y <- round(rnorm(80000), 5) # empirical distribution with some ties
x <- round(rnorm(130000), 5) # observed values with some ties
# current way
system.time({
p1 <- numeric( length(x) )
for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
})
[1] 484.67 25.08 512.04 0.00 0.00
# suggested solution
system.time({
break.points <- c(-Inf, unique(sort(y)), Inf)
p2 <- cut( x, breaks=break.points, labels=FALSE )
p2 <- 1 - p2/length(break.points)
})
[1] 0.27 0.01 0.28 0.00 0.00
head( cbind( p1, p2 ) )
p1 p2
[1,] 0.658375 0.65813482
[2,] 0.144000 0.14533705
[3,] 0.815500 0.81436898
[4,] 0.412500 0.41269640
[5,] 0.553625 0.55334516
[6,] 0.044500 0.04510897
...
cor(p1, p2)
[1] 0.9999987
The difference arises mainly because I had to use a unique breakpoints
in cut(). You may want to investigate further if you are interested.
Please let me know if you find anything good or bad about this
suggestion as I am using it as part of my codes too. Thank you.
Regards, Adai
On Thu, 2005-03-03 at 17:22 -0500, Sean Davis wrote:
I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions? Thanks, Sean
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 3/3/05 19:04, "Adaikalavan Ramasamy" <ramasamy at cancer.org.uk> wrote:
One solution is to cut() 'x' according to the breaks defined by 'y'.
Using cut with labels=FALSE is really fast. See a simulation below.
However the accuracy depends on the number of ties you have in your
"empirical" distribution. I have tried to simulate with the round()
function below.
# simulate data
y <- round(rnorm(80000), 5) # empirical distribution with some ties
x <- round(rnorm(130000), 5) # observed values with some ties
# current way
system.time({
p1 <- numeric( length(x) )
for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
})
[1] 484.67 25.08 512.04 0.00 0.00
# suggested solution
system.time({
break.points <- c(-Inf, unique(sort(y)), Inf)
p2 <- cut( x, breaks=break.points, labels=FALSE )
p2 <- 1 - p2/length(break.points)
})
[1] 0.27 0.01 0.28 0.00 0.00
head( cbind( p1, p2 ) )
p1 p2
[1,] 0.658375 0.65813482
[2,] 0.144000 0.14533705
[3,] 0.815500 0.81436898
[4,] 0.412500 0.41269640
[5,] 0.553625 0.55334516
[6,] 0.044500 0.04510897
...
cor(p1, p2)
[1] 0.9999987
The difference arises mainly because I had to use a unique breakpoints
in cut(). You may want to investigate further if you are interested.
Please let me know if you find anything good or bad about this
suggestion as I am using it as part of my codes too. Thank you.
Regards, Adai
Adai, All I have to say is, "WOW!". I'll give it a try tomorrow and let you know. Thanks, Sean
On 3/3/05 17:40, "Deepayan Sarkar" <deepayan at stat.wisc.edu> wrote:
On Thursday 03 March 2005 16:32, Deepayan Sarkar wrote:
On Thursday 03 March 2005 16:22, Sean Davis wrote:
I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions?
The obvious thing to do would be p.val = 1 - ecdf(x)(y)
or rather: p.val = 1 - ecdf(y)(x)
Deepayan, Thanks (and to Martin, also). This works wonderfully. I didn't expect such a function to exist, but knowing of it will simplify matters significantly for me. Sean