Fastest non-overlapping binning mean function out there?
On 10/02/2012 06:11 PM, Herv? Pag?s wrote:
Hi Henrik, On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
You didn't say if you have a lot of bins or not. If you don't have a lot of bins (e.g. < 10000), something like aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean) might not be too bad:
> x <- seq(0, 8, by=0.1) > bx <- c(2, 2.5, 4, 5.8) > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
bin x 1 0 0.95 2 1 2.20 3 2 3.20 4 3 4.85 5 4 6.90
Of course, if you have a lot of bins, using aggregate() is not optimal.
But you can replace it by your own optimized version e.g.:
## 'bin' must be a sorted vector of non-negative integers of the
## same length as 'x'.
fast_aggregate_mean <- function(x, bin, nbins)
{
bin_count <- tabulate(bin + 1L, nbins=nbins)
diff(c(0L, cumsum(x)[cumsum(bin_count)])) / bin_count
}
Then:
bin <- findInterval(x, bx)
fast_aggregate_mean(x, bin, nbins=length(bx)+1L)
On my machine this is 100x faster or more than using aggregate() when
the number of bins is > 100k. Memory usage is also reduced a lot.
Another benefit of using fast_aggregate_mean() over aggregate() is
that all the bins are represented in the output (aggregate() ignores
empty bins).
Cheers,
H.
I didn't try it on a 10,000-millions-elements vector though (and I've no idea how I could do this). H.
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319