Find number of elements less than some number: Elegant/fast solution needed

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110414/f776b835/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110414/5468a5dd/attachment.pl>

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in 'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})
[1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths in the millions.

I'm certain an elegant (and computationally efficient) solution exists, but I'm in the weeds at this point.

Any help is much appreciated.

Kevin

University of Manchester

I started working on a solution to your problem above and then noted the one below.

Here is one approach to the above:
colSums(outer(y, x, "<"))
[1] 0 0 0 0 1 2 2 3 3 4
Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new placement (index) in the resulting vector (new): 

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently* deduce the new 'index' value directly from the original 'x' value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x' values, and I need to convert them to the corresponding 'index' if the 'y' values are removed.
Something like the following might work, if I correctly understand the problem:
match(x, x[-y])
[1]  1 NA  2  3 NA NA  4  5 NA  6

HTH,

Marc Schwartz
This might be a bit quicker with larger vectors:

f <- function(x, y) sum(x > y)
vf <- Vectorize(f, "x")
vf(x, y)
On Apr 14, 2011, at 2:34 PM, Kevin Ummel wrote:

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in 'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})
[1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths in the millions.

I'm certain an elegant (and computationally efficient) solution exists, but I'm in the weeds at this point.

Any help is much appreciated.

Kevin

University of Manchester

I started working on a solution to your problem above and then noted the one below.

Here is one approach to the above:

colSums(outer(y, x, "<"))
?[1] 0 0 0 0 1 2 2 3 3 4

Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new placement (index) in the resulting vector (new):

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently* deduce the new 'index' value directly from the original 'x' value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x' values, and I need to convert them to the corresponding 'index' if the 'y' values are removed.

Something like the following might work, if I correctly understand the problem:

match(x, x[-y])
?[1] ?1 NA ?2 ?3 NA NA ?4 ?5 NA ?6

HTH,

Marc Schwartz

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
Sent: Thursday, April 14, 2011 12:35 PM
To: r-help at r-project.org
Subject: [R] Find number of elements less than some number: 
Elegant/fastsolution needed

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in 
'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})
 [1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths 
in the millions.

I'm certain an elegant (and computationally efficient) 
solution exists, but I'm in the weeds at this point.
If x is sorted findInterval(x, y) would do it for <= but you
want strict <.  Try
  f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
where your version is
  f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
E.g.,
  > x <- sort(sample(1e6, size=1e5))
  > y <- sample(1e6, size=1e4, replace=TRUE)
  > system.time(r0 <- f0(x,y))
     user  system elapsed
    7.900   0.310   8.211
  > system.time(r2 <- f2(x,y))
     user  system elapsed
    0.000   0.000   0.004
  > identical(r0, r2)
  [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
Any help is much appreciated.

Kevin

University of Manchester

Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new 
placement (index) in the resulting vector (new): 

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently* 
deduce the new 'index' value directly from the original 'x' 
value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x' 
values, and I need to convert them to the corresponding 
'index' if the 'y' values are removed.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

My colleague Sunduz Keles once mentioned a similar problem to me.  She
had a large sample from a reference distribution and a test sample
(both real-valued in her case) and she wanted, for each element of the
test sample, the proportion of the reference sample that was less than
the element.  It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but
they were in the tens of thousands or larger.  Solutions in R tended
to involve comparing every element in the test sample to every element
in the reference sample but, of course, that is unnecessary.  If you
sort both samples then you can start the comparisons for a particular
element of the test sample at the element of the reference sample
where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package
that provided about a 1000-fold increase in speed relative to the best
I could do in R.  I don't have the script on this computer so I will
post it tomorrow when I am back on the computer at the office.

Apologies for cross-posting to the Rcpp-devel list but I am doing so
because this might make a good example of the usefulness of Rcpp and
inline.
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
Sent: Thursday, April 14, 2011 12:35 PM
To: r-help at r-project.org
Subject: [R] Find number of elements less than some number:
Elegant/fastsolution needed

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in
'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})
?[1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths
in the millions.

I'm certain an elegant (and computationally efficient)
solution exists, but I'm in the weeds at this point.
If x is sorted findInterval(x, y) would do it for <= but you
want strict <. ?Try
?f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
where your version is
?f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
E.g.,
?> x <- sort(sample(1e6, size=1e5))
?> y <- sample(1e6, size=1e4, replace=TRUE)
?> system.time(r0 <- f0(x,y))
? ? user ?system elapsed
? ?7.900 ? 0.310 ? 8.211
?> system.time(r2 <- f2(x,y))
? ? user ?system elapsed
? ?0.000 ? 0.000 ? 0.004
?> identical(r0, r2)
?[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

Any help is much appreciated.

Kevin

University of Manchester

Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new
placement (index) in the resulting vector (new):

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently*
deduce the new 'index' value directly from the original 'x'
value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x'
values, and I need to convert them to the corresponding
'index' if the 'y' values are removed.

? ? ? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bill's code is insanely fast!

f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))

 n1 <- 1e07
n2 <- 10^c(1,2,3,4,5,6,7)
tt <- rep(NA, 7)
x <- rnorm(n1)
for (i in 1:length(n2)){
y <- runif(n2[i])
tt[i] <- system.time(a1 <- f2(x, y))[3]
}
tt
[1]  0.70  0.86  1.03  1.28  1.54  4.99 12.07

I would be surprised if this can be beaten even in C/C++.

Ravi.
My colleague Sunduz Keles once mentioned a similar problem to me.  She
had a large sample from a reference distribution and a test sample
(both real-valued in her case) and she wanted, for each element of the
test sample, the proportion of the reference sample that was less than
the element.  It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but
they were in the tens of thousands or larger.  Solutions in R tended
to involve comparing every element in the test sample to every element
in the reference sample but, of course, that is unnecessary.  If you
sort both samples then you can start the comparisons for a particular
element of the test sample at the element of the reference sample
where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package
that provided about a 1000-fold increase in speed relative to the best
I could do in R.  I don't have the script on this computer so I will
post it tomorrow when I am back on the computer at the office.

Apologies for cross-posting to the Rcpp-devel list but I am doing so
because this might make a good example of the usefulness of Rcpp and
inline.

On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
>> Sent: Thursday, April 14, 2011 12:35 PM
>> To: r-help at r-project.org
>> Subject: [R] Find number of elements less than some number:
>> Elegant/fastsolution needed
>>
>> Take vector x and a subset y:
>>
>> x=1:10
>>
>> y=c(4,5,7,9)
>>
>> For each value in 'x', I want to know how many elements in
>> 'y' are less than 'x'.
>>
>> An example would be:
>>
>> sapply(x,FUN=function(i) {length(which(y<i))})
>>  [1] 0 0 0 0 1 2 2 3 3 4
>>
>> But this solution is far too slow when x and y have lengths
>> in the millions.
>>
>> I'm certain an elegant (and computationally efficient)
>> solution exists, but I'm in the weeds at this point.
>
> If x is sorted findInterval(x, y) would do it for <= but you
> want strict <.  Try
>  f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
> where your version is
>  f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
> E.g.,
>  > x <- sort(sample(1e6, size=1e5))
>  > y <- sample(1e6, size=1e4, replace=TRUE)
>  > system.time(r0 <- f0(x,y))
>     user  system elapsed
>    7.900   0.310   8.211
>  > system.time(r2 <- f2(x,y))
>     user  system elapsed
>    0.000   0.000   0.004
>  > identical(r0, r2)
>  [1] TRUE
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>> Any help is much appreciated.
>>
>> Kevin
>>
>> University of Manchester
>>
>>
>>
>>
>>
>>
>>
>>
>> Take two vectors x and y, where y is a subset of x:
>>
>> x=1:10
>>
>> y=c(2,5,6,9)
>>
>> If y is removed from x, the original x values now have a new
>> placement (index) in the resulting vector (new):
>>
>> new=x[-y]
>>
>> index=1:length(new)
>>
>> The challenge is: How can I *quickly* and *efficiently*
>> deduce the new 'index' value directly from the original 'x'
>> value -- using only 'y' as an input?
>>
>> In practice, I have very large matrices containing the 'x'
>> values, and I need to convert them to the corresponding
>> 'index' if the 'y' values are removed.
>>
>>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Bill's code is insanely fast!

f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))

?n1 <- 1e07
n2 <- 10^c(1,2,3,4,5,6,7)
tt <- rep(NA, 7)
x <- rnorm(n1)
for (i in 1:length(n2)){
y <- runif(n2[i])
tt[i] <- system.time(a1 <- f2(x, y))[3]
}

tt
[1] ?0.70 ?0.86 ?1.03 ?1.28 ?1.54 ?4.99 12.07

I would be surprised if this can be beaten even in C/C++.
Have you looked at the definition of the findInterval function,
especially the part where it calls the C function?

I still think that it can be done faster if the x vector is sorted in
addition to the y vector, although admittedly the speed gain will not
be as much.  And, of course, Bill's solution depends on knowing R or
S-PLUS well enough to remember that there is a findInterval function.
From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf Of Douglas Bates [bates at stat.wisc.edu]
Sent: Thursday, April 14, 2011 6:22 PM
To: William Dunlap
Cc: r-help at r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel
Subject: Re: [R] Find number of elements less than some number: Elegant/fastsolution needed

My colleague Sunduz Keles once mentioned a similar problem to me. ?She
had a large sample from a reference distribution and a test sample
(both real-valued in her case) and she wanted, for each element of the
test sample, the proportion of the reference sample that was less than
the element. ?It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but
they were in the tens of thousands or larger. ?Solutions in R tended
to involve comparing every element in the test sample to every element
in the reference sample but, of course, that is unnecessary. ?If you
sort both samples then you can start the comparisons for a particular
element of the test sample at the element of the reference sample
where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package
that provided about a 1000-fold increase in speed relative to the best
I could do in R. ?I don't have the script on this computer so I will
post it tomorrow when I am back on the computer at the office.

Apologies for cross-posting to the Rcpp-devel list but I am doing so
because this might make a good example of the usefulness of Rcpp and
inline.

On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
Sent: Thursday, April 14, 2011 12:35 PM
To: r-help at r-project.org
Subject: [R] Find number of elements less than some number:
Elegant/fastsolution needed

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in
'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})
?[1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths
in the millions.

I'm certain an elegant (and computationally efficient)
solution exists, but I'm in the weeds at this point.
If x is sorted findInterval(x, y) would do it for <= but you
want strict <. ?Try
?f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
where your version is
?f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
E.g.,
?> x <- sort(sample(1e6, size=1e5))
?> y <- sample(1e6, size=1e4, replace=TRUE)
?> system.time(r0 <- f0(x,y))
? ? user ?system elapsed
? ?7.900 ? 0.310 ? 8.211
?> system.time(r2 <- f2(x,y))
? ? user ?system elapsed
? ?0.000 ? 0.000 ? 0.004
?> identical(r0, r2)
?[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

Any help is much appreciated.

Kevin

University of Manchester

Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new
placement (index) in the resulting vector (new):

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently*
deduce the new 'index' value directly from the original 'x'
value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x'
values, and I need to convert them to the corresponding
'index' if the 'y' values are removed.

? ? ? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Ravi:

Well, that's a challenge!

Here is a solution which is faster than Bill's as long as length(y) << length(x).

f3 <- function(x, y) length(y)*ecdf(y)(x)

Then using your test on my workstation, I get for f2():
tt
[1]  0.761  1.071  1.329  1.565  1.894  3.865 11.824

and for f3():
tt
[1]  0.717  0.882  1.010  1.161  1.389  3.080 15.530

I suspect ecdf() is a little more intuitive than findInterval().

Hope this helps,
Ray Brownrigg
Bill's code is insanely fast!

f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))

 n1 <- 1e07
n2 <- 10^c(1,2,3,4,5,6,7)
tt <- rep(NA, 7)
x <- rnorm(n1)
for (i in 1:length(n2)){
y <- runif(n2[i])
tt[i] <- system.time(a1 <- f2(x, y))[3]
}

tt
[1]  0.70  0.86  1.03  1.28  1.54  4.99 12.07

I would be surprised if this can be beaten even in C/C++.

Ravi.
________________________________________
From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf
Of Douglas Bates [bates at stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22
PM
To: William Dunlap
Cc: r-help at r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel
Subject: Re: [R] Find number of elements less than some number:
Elegant/fastsolution needed

My colleague Sunduz Keles once mentioned a similar problem to me.  She
had a large sample from a reference distribution and a test sample
(both real-valued in her case) and she wanted, for each element of the
test sample, the proportion of the reference sample that was less than
the element.  It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but
they were in the tens of thousands or larger.  Solutions in R tended
to involve comparing every element in the test sample to every element
in the reference sample but, of course, that is unnecessary.  If you
sort both samples then you can start the comparisons for a particular
element of the test sample at the element of the reference sample
where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package
that provided about a 1000-fold increase in speed relative to the best
I could do in R.  I don't have the script on this computer so I will
post it tomorrow when I am back on the computer at the office.

Apologies for cross-posting to the Rcpp-devel list but I am doing so
because this might make a good example of the usefulness of Rcpp and
inline.

On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
Sent: Thursday, April 14, 2011 12:35 PM
To: r-help at r-project.org
Subject: [R] Find number of elements less than some number:
Elegant/fastsolution needed

Take vector x and a subset y:

x=1:10

y=c(4,5,7,9)

For each value in 'x', I want to know how many elements in
'y' are less than 'x'.

An example would be:

sapply(x,FUN=function(i) {length(which(y<i))})

 [1] 0 0 0 0 1 2 2 3 3 4

But this solution is far too slow when x and y have lengths
in the millions.

I'm certain an elegant (and computationally efficient)
solution exists, but I'm in the weeds at this point.

If x is sorted findInterval(x, y) would do it for <= but you
want strict <.  Try

 f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))

where your version is

 f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})

E.g.,

 > x <- sort(sample(1e6, size=1e5))
 > y <- sample(1e6, size=1e4, replace=TRUE)
 > system.time(r0 <- f0(x,y))
 > 
    user  system elapsed

   7.900   0.310   8.211

 > system.time(r2 <- f2(x,y))
 > 
    user  system elapsed

   0.000   0.000   0.004

 > identical(r0, r2)

 [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

Any help is much appreciated.

Kevin

University of Manchester

Take two vectors x and y, where y is a subset of x:

x=1:10

y=c(2,5,6,9)

If y is removed from x, the original x values now have a new
placement (index) in the resulting vector (new):

new=x[-y]

index=1:length(new)

The challenge is: How can I *quickly* and *efficiently*
deduce the new 'index' value directly from the original 'x'
value -- using only 'y' as an input?

In practice, I have very large matrices containing the 'x'
values, and I need to convert them to the corresponding
'index' if the 'y' values are removed.

      [[alternative HTML version deleted]]