Skip to content
Prev 44192 / 63421 Next

Fastest non-overlapping binning mean function out there?

On 10/3/2012 6:47 AM, Martin Morgan wrote:
I'll take my solution back. The problem specification says that x has 
10,000-millions of elements, so we need to use R-devel and

     R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;

but there are two further issues. The first is that on my system

p$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have __SIZEOF_SIZE_T__ 8 but

(a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I end 
up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using the 
inline package, to avoid that level of uncertainty) and then
(b) because nx is a signed type, and since nx > .Machine$integer.max is 
represented as a negative number, I don't ever iterate this loop. So I'd have to 
be more clever if I wanted this to work on all platforms.

For what it's worth, Herve's solution is also problematic

 > xx = findInterval(bx, x)
Error in findInterval(bx, x) : long vector 'vec' is not supported

A different strategy for the problem at hand would seem to involve iteration 
over sequences of x, collecting sufficient statistics (n, sum) for each 
iteration, and calculating the mean at the end of the day. This might also 
result in better memory use and allow parallel processing.

Martin