Skip to content
Prev 65370 / 398513 Next

Rank-based p-value on large dataset

One solution is to cut() 'x' according to the breaks defined by 'y'.
Using cut with labels=FALSE is really fast. See a simulation below.

However the accuracy depends on the number of ties you have in your
"empirical" distribution. I have tried to simulate with the round()
function below.

# simulate data
y <- round(rnorm(80000),  5) # empirical distribution with some ties
x <- round(rnorm(130000), 5) # observed values with some ties

# current way
system.time({
  p1 <- numeric( length(x) )
  for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) }
})
[1] 484.67  25.08 512.04   0.00   0.00

# suggested solution
system.time({
  break.points <- c(-Inf, unique(sort(y)), Inf)
  p2 <- cut( x, breaks=break.points, labels=FALSE )
  p2 <- 1 - p2/length(break.points)
})
[1] 0.27 0.01 0.28 0.00 0.00
 

head( cbind( p1, p2 ) )
           p1         p2
[1,] 0.658375 0.65813482
[2,] 0.144000 0.14533705
[3,] 0.815500 0.81436898
[4,] 0.412500 0.41269640
[5,] 0.553625 0.55334516
[6,] 0.044500 0.04510897
...

cor(p1, p2)
[1] 0.9999987


The difference arises mainly because I had to use a unique breakpoints
in cut(). You may want to investigate further if you are interested.
Please let me know if you find anything good or bad about this
suggestion as I am using it as part of my codes too. Thank you.

Regards, Adai
On Thu, 2005-03-03 at 17:22 -0500, Sean Davis wrote: