[Rcpp-devel] Numeric comparison of two long vectors [was: quick question]
Hi Sunduz, Thanks for the question. As I mentioned when we spoke, this will make a good demonstration of the Rcpp interface to compiled code in C++ because it is the type of calculation that is going to be very slow in native R code but can be done quite quickly in C++. I am cc:'ing the Rcpp-devel list on this so that others can comment and perhaps compare methods. The gist of the question is that you have long vector, tobs, of observed values and a reference distribution represented by tperm. For each element, tobs[i], you want to know the proportion of the values in tperm less than tobs[i]. In the current form the problem is that you are doing N * B comparisons plus a lot of sums and all that. The first thing to do to speed it up is to sort tperm so you only need to compare to the first j satisfying tperm[j] > tobs[i]. If you also sort tobs then you can start at the point where the last comparison failed. For tobs you should use order, rather than sort which will allow you to associate the pvalues with the original ordering of tobs. The output I enclose is before assigning the pvalues back to the correct observations (and may have a few "infelicities", as Bill Venables calls them, still lurking around).
On Thu, Apr 29, 2010 at 1:40 PM, Sunduz Keles <keles at stat.wisc.edu> wrote:
Hi Doug,
we are trying to implement a simple permutation approach for getting a null
dist. Generating observations from the permutation null distribution seems
pretty fast but when it comes to getting the actual p-value which is just a
simple sum of number of permuted test stat which are smaller than or equal
to what we actually observe, things get pretty slow becaue we have a large
number of observed test stats.
Here is a piece of code from what we have
B = 6000000
N = 3000000
set.seed(1)
tperm = rnorm(B, 0, 1) #we get these from somewhere else but for simplicity
let's assume that we generate them like this.
tobs = rnorm(N, 0, 1)
tobs[sample(1:N, 10000, replace = F)] ?= rnorm(1000, 2, 1)
#This is super slow.
pvals = apply(as.matrix(tobs), 1, function(x, y, B){sum(x>=y)/B}, tperm, B)
Maybe there is really no way around this but I was wondering whether you had
any suggestions for replacing sum(x>=y). If we have for example pnorm
instead of sum(x>=y) that runs much much faster.
Thanks much in advance!
Sunduz
------------------------------------------------------------------------------------
Sunduz Keles
?http://www.stat.wisc.edu/~keles
Associate Professor
Department of Statistics ? ? ? ? ? ? ? ? ? ? ? ? ? ?tel: 608-263-4533 (1245B
?MSC)
Department of Biostatistics and Medical Informatics
University of Wisconsin-Madison
-----------------------------------------------------------------------------------
-------------- next part -------------- R version 2.10.1 (2009-12-14) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
nperm <- 6000000 nobs <- 3000000 set.seed(1) tperm <- sort(rnorm(nperm)) # these can be sorted immediately tobs <- rnorm(nobs) ## contaminate the sample nbig <- 10000 tobs[sample(1:nobs, nbig, replace = FALSE)] <- rnorm(nbig, mean = 2) ord <- order(tobs) tord <- tobs[ord] library(inline) # allows you to compile and load C++ code cpp <- '
+ Rcpp::NumericVector tord(tordP), tperm(tpermP);
+ Rcpp::NumericVector ans(tord.size()); // result vector
+ double *to = tord.begin(), *tp = tperm.begin(), *aa = ans.begin();
+ int nobs = tord.size(), nperm = tperm.size();
+ for (int i = 0, j = 0; i < nobs; i++) {
+ while (tp[j] < to[i] && j < nperm) j++;
+ aa[i] = ((double)j)/((double)nperm);
+ }
+ return ans;
+ '
ff <- cfunction(signature(tordP = "numeric", tpermP = "numeric"), cpp, Rcpp = TRUE, )
Loading required package: Rcpp
str(ans <- ff(tord, tperm))
num [1:3000000] 1.67e-07 1.67e-07 8.33e-07 1.00e-06 1.67e-06 ...
proc.time()
user system elapsed 13.140 1.310 14.069