Skip to content

Find number of elements less than some number: Elegant/fast solution needed

9 messages · Kevin Ummel, Richard M. Heiberger, Marc Schwartz +5 more

#
On Apr 14, 2011, at 2:34 PM, Kevin Ummel wrote:

            
I started working on a solution to your problem above and then noted the one below.

Here is one approach to the above:
[1] 0 0 0 0 1 2 2 3 3 4
Something like the following might work, if I correctly understand the problem:
[1]  1 NA  2  3 NA NA  4  5 NA  6


HTH,

Marc Schwartz
#
This might be a bit quicker with larger vectors:

f <- function(x, y) sum(x > y)
vf <- Vectorize(f, "x")
vf(x, y)
On Thu, Apr 14, 2011 at 5:37 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
#
If x is sorted findInterval(x, y) would do it for <= but you
want strict <.  Try
  f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
where your version is
  f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
E.g.,
  > x <- sort(sample(1e6, size=1e5))
  > y <- sample(1e6, size=1e4, replace=TRUE)
  > system.time(r0 <- f0(x,y))
     user  system elapsed
    7.900   0.310   8.211
  > system.time(r2 <- f2(x,y))
     user  system elapsed
    0.000   0.000   0.004
  > identical(r0, r2)
  [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
My colleague Sunduz Keles once mentioned a similar problem to me.  She
had a large sample from a reference distribution and a test sample
(both real-valued in her case) and she wanted, for each element of the
test sample, the proportion of the reference sample that was less than
the element.  It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but
they were in the tens of thousands or larger.  Solutions in R tended
to involve comparing every element in the test sample to every element
in the reference sample but, of course, that is unnecessary.  If you
sort both samples then you can start the comparisons for a particular
element of the test sample at the element of the reference sample
where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package
that provided about a 1000-fold increase in speed relative to the best
I could do in R.  I don't have the script on this computer so I will
post it tomorrow when I am back on the computer at the office.

Apologies for cross-posting to the Rcpp-devel list but I am doing so
because this might make a good example of the usefulness of Rcpp and
inline.
On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
#
Bill's code is insanely fast!

f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
 
 n1 <- 1e07
n2 <- 10^c(1,2,3,4,5,6,7)
tt <- rep(NA, 7)
x <- rnorm(n1)
for (i in 1:length(n2)){
y <- runif(n2[i])
tt[i] <- system.time(a1 <- f2(x, y))[3]
}
[1]  0.70  0.86  1.03  1.28  1.54  4.99 12.07
I would be surprised if this can be beaten even in C/C++.

Ravi.
#
On Thu, Apr 14, 2011 at 8:19 PM, Ravi Varadhan <rvaradhan at jhmi.edu> wrote:
Have you looked at the definition of the findInterval function,
especially the part where it calls the C function?

I still think that it can be done faster if the x vector is sorted in
addition to the y vector, although admittedly the speed gain will not
be as much.  And, of course, Bill's solution depends on knowing R or
S-PLUS well enough to remember that there is a findInterval function.
#
Ravi:

Well, that's a challenge!

Here is a solution which is faster than Bill's as long as length(y) << length(x).

f3 <- function(x, y) length(y)*ecdf(y)(x)

Then using your test on my workstation, I get for f2():
[1]  0.761  1.071  1.329  1.565  1.894  3.865 11.824

and for f3():
[1]  0.717  0.882  1.010  1.161  1.389  3.080 15.530

I suspect ecdf() is a little more intuitive than findInterval().

Hope this helps,
Ray Brownrigg
On Fri, 15 Apr 2011, Ravi Varadhan wrote: