Skip to content

Fastest non-overlapping binning mean function out there?

6 messages · Henrik Bengtsson, Hervé Pagès, Martin Morgan

#
Hi,

I'm looking for a super-duper fast mean/sum binning implementation
available in R, and before implementing z = binnedMeans(x y) in native
code myself, does any one know of an existing function/package for
this?  I'm sure it already exists.  So, given data (x,y) and B bins
bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B].  Let's
assume there are no missing values and 'x' and 'bx' is already
ordered.  The length of 'x' is in the order of 10,000-millions.  The
number of elements in each bin vary.

Thanks,

Henrik
#
Hi Henrik,
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
You didn't say if you have a lot of bins or not. If you don't have a lot
of bins (e.g. < 10000), something like

   aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)

might not be too bad:

   > x <- seq(0, 8, by=0.1)
   > bx <- c(2, 2.5, 4, 5.8)
   > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
     bin    x
   1   0 0.95
   2   1 2.20
   3   2 3.20
   4   3 4.85
   5   4 6.90

I didn't try it on a 10,000-millions-elements vector though (and I've
no idea how I could do this).

H.

  
    
#
On 10/02/2012 06:11 PM, Herv? Pag?s wrote:
Of course, if you have a lot of bins, using aggregate() is not optimal.
But you can replace it by your own optimized version e.g.:

   ## 'bin' must be a sorted vector of non-negative integers of the
   ## same length as 'x'.
   fast_aggregate_mean <- function(x, bin, nbins)
   {
     bin_count <- tabulate(bin + 1L, nbins=nbins)
     diff(c(0L, cumsum(x)[cumsum(bin_count)])) / bin_count
   }

Then:

   bin <- findInterval(x, bx)
   fast_aggregate_mean(x, bin, nbins=length(bx)+1L)

On my machine this is 100x faster or more than using aggregate() when
the number of bins is > 100k. Memory usage is also reduced a lot.
Another benefit of using fast_aggregate_mean() over aggregate() is
that all the bins are represented in the output (aggregate() ignores
empty bins).

Cheers,
H.

  
    
#
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
since x and bx are ordered (sorting x would be expensive), the C code seems 
pretty straight-forward and memory-efficient -- create a result vector as long 
as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i]. 
(I think R's implementation of mean() actually pays more attention to numerical 
error, but with this much data...)

library(inline)
binmean <- cfunction(signature(x="numeric", bx="numeric"),
"   int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
     SEXP ans = PROTECT(NEW_NUMERIC(nb));
     double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
     sum = j = n = 0;
     for (i = 0; i < nx; ++i) {
         while (xp[i] >= bxp[j]) {
              ansp[j++] = n > 0 ? sum / n : 0;
              sum = n = 0;
         }
         n += 1;
         sum += xp[i];
     }
     ansp[j] = n > 0 ? sum / n : 0;
     UNPROTECT(1);
     return ans;
")

with a test case

nx <- 4e7
nb <- 1e3
x <- sort(runif(nx))
bx <- do.call(seq, c(as.list(range(x)), length.out=nb))

leading to

 > bx1 <- c(bx[-1], bx[nb] + 1)
 > system.time(res <- binmean(x, bx1))
    user  system elapsed
   0.052   0.000   0.050

Martin

  
    
#
On 10/3/2012 6:47 AM, Martin Morgan wrote:
I'll take my solution back. The problem specification says that x has 
10,000-millions of elements, so we need to use R-devel and

     R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;

but there are two further issues. The first is that on my system

p$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have __SIZEOF_SIZE_T__ 8 but

(a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I end 
up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using the 
inline package, to avoid that level of uncertainty) and then
(b) because nx is a signed type, and since nx > .Machine$integer.max is 
represented as a negative number, I don't ever iterate this loop. So I'd have to 
be more clever if I wanted this to work on all platforms.

For what it's worth, Herve's solution is also problematic

 > xx = findInterval(bx, x)
Error in findInterval(bx, x) : long vector 'vec' is not supported

A different strategy for the problem at hand would seem to involve iteration 
over sequences of x, collecting sufficient statistics (n, sum) for each 
iteration, and calculating the mean at the end of the day. This might also 
result in better memory use and allow parallel processing.

Martin

  
    
#
Thank you all - I appreciate all your suggestions.  My post was mainly
to check if there's already an existing and fast implementation out
there.  I've ended up adopting Martin's first proposal.  Something
like this:

#include <Rdefines.h>
#include <R.h>
#include <R_ext/Error.h>

SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
  int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
  double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
  SEXP ans = PROTECT(NEW_NUMERIC(nb));
  double *ansp = REAL(ans);
  int retcount = LOGICAL(retCount)[0];
  SEXP count = NULL;
  int *countp = NULL;
  int i = 0, j = 0, n = 0, iStart=0;
  double sum = 0.0;

  // Assert same lengths of 'x' and 'y'
  if (ny != nx) {
    error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
  }

  if (retcount) {
    count = PROTECT(NEW_INTEGER(nb));
    countp = INTEGER(count);
  }

  // Skip to the first bin
  while (xp[iStart] < bxp[0]) {
    ++iStart;
  }

  // For each x...
  for (i = iStart; i < nx; ++i) {
    // Skip to a new bin?
    while (xp[i] >= bxp[j+1]) {
      // Update statistic of current bin?
      if (retcount) { countp[j] = n; }
      ansp[j] = n > 0 ? sum / n : 0;
      sum = 0.0;
      n = 0;
      // ...and move to next
      ++j;
    }
    sum += yp[i];
    n += 1;
  }

  // Update statistic of last bin
  ansp[j] = n > 0 ? sum / n : 0;
  if (retcount) {
    countp[j] = n;
    setAttrib(ans, install("count"), count);
    UNPROTECT(1); // 'count'
  }
  UNPROTECT(1); // 'ans'

  return ans;
} // binMeans()

BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6)
and not a scalar (10^10), but the misinterpretation got Martin to
point out some of the exciting work extending R core to support "very
large" integers.  And, as Herve pointed out, I forgot to say that the
number of bins could be of that range as well, or maybe an order of
magnitude less (say ~1-100 data points per bin).

/Henrik
On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote: