[Rcpp-devel] [R] Find number of elements less than some number: Elegant/fastsolution needed
I enclose some comparisons, admittedly on only one size of example (10 million in the reference distribution and 100,000 in the test sample). The versions using Rcpp and the std algorithms came out about 3 times as fast as the version using R's findInterval. What surprises me is that the sequential comparisons in C++ (version k) is slightly faster than the binary search versions (h and j). The unassisted binary search requires 30 probes for each element of ans and the sequential comparisons show take an average of 100 comparisons. I suppose the difference is sequential access versus random access to the elements of y.
On Fri, Apr 15, 2011 at 7:34 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
On Thu, Apr 14, 2011 at 11:47 PM, Christian Gunning <xian at unm.edu> wrote:
On Thu, Apr 14, 2011 at 7:02 PM, <rcpp-devel-request at r-forge.wu-wien.ac.at> wrote:
I was able to write a very short C++ function using the Rcpp package that provided about a 1000-fold increase in speed relative to the best I could do in R. ?I don't have the script on this computer so I will post it tomorrow when I am back on the computer at the office. Apologies for cross-posting to the Rcpp-devel list but I am doing so because this might make a good example of the usefulness of Rcpp and inline.
And RcppArmadillo, as the case may be.
This is a cool little problem. ?In the examples given, I'd caution
people against comparing apples and durian. ?The sort(x) is a cost
that should be considered *within* each implementation. I used
Armadillo to sort (src, f4), and get another 100% worth of speedup
that I can't reproducing using R's sort (src1, f1-f3). ?If i modify
SEXP in-place (and this always confuses me, so I tend to avoid it),
I'm seeing an additional ~5-10% speed gain (src2, f5) -- the advantage
of this last seems to be primarily in memory-constrained applications.
On to the code!
src = '
NumericVector xx_(clone(x)), yy_(clone(y));
int nxx = xx_.size();
int nyy = yy_.size();
arma::vec xx(xx_), yy(yy_);
yy = sort(yy);
xx = sort(xx);
//
//
int j = 0; //gt index for yy
for (int i=0; i < nxx; i++) {
? ?while ((j < nyy) && ( xx(i) > yy(j) ) ) {
? ? ? ?j++;
? ?}
? ?xx_(i) = j;
? }
return (xx_);
'
src1 = '
NumericVector xx_(clone(x)), yy_(clone(y));
// assumes x & y are already sorted
arma::vec xx(xx_), yy(yy_);
int nxx = xx.n_elem;
int nyy = yy.n_elem;
int j = 0; //gt index for yy
for (int i=0; i < nxx; i++) {
? ?while ((j < nyy) && ( xx(i) > yy(j) ) ) {
? ? ? ?j++;
? ?}
? ?xx_(i) = j;
?}
return (xx_);
'
src2 = '
NumericVector xx_(x), yy_(y); ?//kinda scary
int nxx = xx_.size();
int nyy = yy_.size();
arma::vec xx(xx_.begin(), nxx, false), yy(yy_.begin(), nyy, false);
//really kinda scary
yy = sort(yy);
xx = sort(xx);
//
//
int j = 0; //gt index for yy
for (int i=0; i < nxx; i++) {
? ?while ((j < nyy) && ( xx(i) > yy(j) ) ) {
? ? ? ?j++;
? ?}
? ?xx_(i) = j;
}
return (xx_);
'
require(inline)
require(RcppArmadillo)
f1 <- function(x, y) { sort(length(y) - findInterval(-x, rev(-sort(y))));}
f2 <- function(x, y) {x = sort(x); length(y) - findInterval(-x, rev(-sort(y)))}
f3.a <- cxxfunction(signature(x="numeric", y="numeric"), src1,
plugin='RcppArmadillo')
f3 <- function(x,y) {
? ? ? ?x <- sort(x)
? ? ? ?y <- sort(y)
? ? ? ?return(f3.a(x,y))
}
f4 <- cxxfunction(signature(x="numeric", y="numeric"), src,
plugin='RcppArmadillo')
## ?danger -- violates R semantics
f5 <- cxxfunction(signature(x="numeric", y="numeric"), src2,
plugin='RcppArmadillo')
## this is a really ugly test. ygwypf, i suppose :)
for (i in 1:5) {
?x1 <- x <- rnorm(5e6)
?y1 <- y <- rnorm(5e6)
?print( cbind(
? ?r1=system.time(r1 <- f1(x,y)),
? ?r2=system.time(r2 <- f2(x,y)), r3=system.time(r3 <- f3(x1,y1)),
? ?r4 = system.time(r4 <- f4(x,y)), r5 = system.time(r5 <- f5(x,y))
?))
}
print(all.equal(r1, r2))
print(all.equal(r1, r3))
print(all.equal(r1, r4))
print(all.equal(r1, r5))
I agree that the cost of sorting should be taken into account but I don't think you need to go to the RcppArmadillo package to get a sort function. ?Why not use std::sort? Also, I did sequential comparisons as shown in your code but after reading Bill Dunlap's response and looking at the documentation for the findInterval ?function in R I smacked myself on the forehead and thought "Duh - binary search, of course". I haven't looked at the C code underlying the findInterval function yet so I don't know if Martin has clever tricks for sorted x and y. However the documentation for the std::upper_bound template at cplusplus.com shows how to use that for the case here. ?The best I can think of for sorted x and y is to pass the upper bound from x[i] as the first argument in the call to std::upper_bound for x[i+1]. Unfortunately I am staring at a series of deadlines today so implementations and comparisons may need to wait until tomorrow. P.S. to Christian: ?Check the archives for several of Dirk's posts to the rcpp-devel list where he has used the rbenchmark package to produce clean output from comparisons of ?implementations of algorithms.
-------------- next part -------------- R version 2.13.0 (2011-04-13) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
require("inline")
Loading required package: inline
require("Rcpp")
Loading required package: Rcpp
require("rbenchmark")
Loading required package: rbenchmark
## I'm changing the rules a bit here because Sunduz wanted the
## counts in the original order of x
f <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
## findInterval can take advantage of x being ordered
g <- function(x, y) {
+ ord <- order(x) + ans <- integer(length(x)) + ans[ord] <- length(y) - findInterval(-x[ord], rev(-sort(y))) + ans + }
## version using std::upper_bound for unordered x h <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+ {
+ Rcpp::NumericVector x(x_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = x.size();
+ Rcpp::IntegerVector ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n; i++) {
+ ans[i] = std::upper_bound(bb, ee, x[i]) - bb;
+ }
+ return ans;
+ }
+ ', plugin="Rcpp")
## version using std::upper_bound when x is non-decreasing j1 <- cxxfunction(signature(x_="numeric", y_="numeric"), '
+ {
+ Rcpp::NumericVector x(x_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = x.size();
+ Rcpp::IntegerVector ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ Rcpp::NumericVector::iterator mm = y.begin();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n; i++) {
+ mm = std::upper_bound(mm, ee, x[i]);
+ ans[i] = mm - bb;
+ }
+ return ans;
+ }
+ ', plugin="Rcpp")
j <- function(x, y) {
+ ord <- order(x) + ans <- integer(length(x)) + ans[ord] <- j1(x[ord], y) + ans + }
## version using sequential search k1 <- cxxfunction(signature(xs_="numeric", y_="numeric", ord_="integer"), '
+ {
+ Rcpp::NumericVector xs(xs_),
+ y = clone(Rcpp::NumericVector(y_));
+ int n = xs.size();
+ Rcpp::IntegerVector ord(ord_), ans(n);
+ const Rcpp::NumericVector::iterator bb = y.begin(), ee = y.end();
+ Rcpp::NumericVector::iterator yy = y.begin();
+ double *xp = xs.begin();
+ // sort (a copy of) the y values for binary search
+ std::sort(bb, ee);
+
+ for (int i = 0; i < n && yy < ee; i++, xp++) {
+ while (*xp >= *yy && yy < ee) yy++;
+ ans[ord[i] - 1] = yy - bb;
+ }
+ return ans;
+ }
+ ', plugin="Rcpp")
k <- function(x, y) {
+ ord <- order(x) + k1(x[ord], y, ord) + }
x <- rnorm(1e5) yy <- y <- rnorm(1e7) yy[1] <- y[1] # ensure y and yy are distinct aa <- f(x, y) all.equal(aa, g(x, y))
[1] TRUE
all.equal(aa, h(x, y))
[1] TRUE
all.equal(aa, j(x, y))
[1] TRUE
all.equal(aa, k(x, y))
[1] TRUE
all.equal(y, yy) # check that nothing has changed in y
[1] TRUE
benchmark(f(x,y), g(x,y), h(x,y), j(x,y), k(x,y),
+ order="relative", replications=10,
+ columns=c("test", "replications", "elapsed", "relative"))
test replications elapsed relative
5 k(x, y) 10 19.477 1.000000
4 j(x, y) 10 19.984 1.026031
3 h(x, y) 10 20.107 1.032346
2 g(x, y) 10 57.691 2.962006
1 f(x, y) 10 60.563 3.109462
proc.time()
user system elapsed 213.060 15.880 233.067 -------------- next part -------------- A non-text attachment was scrubbed... Name: bb.R Type: application/octet-stream Size: 2865 bytes Desc: not available URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20110417/cc479b0f/attachment.obj>