Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary. Thanks, Henrik
Fastest non-overlapping binning mean function out there?
6 messages · Henrik Bengtsson, Hervé Pagès, Martin Morgan
Hi Henrik,
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
You didn't say if you have a lot of bins or not. If you don't have a lot
of bins (e.g. < 10000), something like
aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
might not be too bad:
> x <- seq(0, 8, by=0.1)
> bx <- c(2, 2.5, 4, 5.8)
> aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
bin x
1 0 0.95
2 1 2.20
3 2 3.20
4 3 4.85
5 4 6.90
I didn't try it on a 10,000-millions-elements vector though (and I've
no idea how I could do this).
H.
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 10/02/2012 06:11 PM, Herv? Pag?s wrote:
Hi Henrik, On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
You didn't say if you have a lot of bins or not. If you don't have a lot of bins (e.g. < 10000), something like aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean) might not be too bad:
> x <- seq(0, 8, by=0.1) > bx <- c(2, 2.5, 4, 5.8) > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
bin x 1 0 0.95 2 1 2.20 3 2 3.20 4 3 4.85 5 4 6.90
Of course, if you have a lot of bins, using aggregate() is not optimal.
But you can replace it by your own optimized version e.g.:
## 'bin' must be a sorted vector of non-negative integers of the
## same length as 'x'.
fast_aggregate_mean <- function(x, bin, nbins)
{
bin_count <- tabulate(bin + 1L, nbins=nbins)
diff(c(0L, cumsum(x)[cumsum(bin_count)])) / bin_count
}
Then:
bin <- findInterval(x, bx)
fast_aggregate_mean(x, bin, nbins=length(bx)+1L)
On my machine this is 100x faster or more than using aggregate() when
the number of bins is > 100k. Memory usage is also reduced a lot.
Another benefit of using fast_aggregate_mean() over aggregate() is
that all the bins are represented in the output (aggregate() ignores
empty bins).
Cheers,
H.
I didn't try it on a 10,000-millions-elements vector though (and I've no idea how I could do this). H.
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
since x and bx are ordered (sorting x would be expensive), the C code seems
pretty straight-forward and memory-efficient -- create a result vector as long
as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i].
(I think R's implementation of mean() actually pays more attention to numerical
error, but with this much data...)
library(inline)
binmean <- cfunction(signature(x="numeric", bx="numeric"),
" int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
SEXP ans = PROTECT(NEW_NUMERIC(nb));
double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
sum = j = n = 0;
for (i = 0; i < nx; ++i) {
while (xp[i] >= bxp[j]) {
ansp[j++] = n > 0 ? sum / n : 0;
sum = n = 0;
}
n += 1;
sum += xp[i];
}
ansp[j] = n > 0 ? sum / n : 0;
UNPROTECT(1);
return ans;
")
with a test case
nx <- 4e7
nb <- 1e3
x <- sort(runif(nx))
bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
leading to
> bx1 <- c(bx[-1], bx[nb] + 1)
> system.time(res <- binmean(x, bx1))
user system elapsed
0.052 0.000 0.050
Martin
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
On 10/3/2012 6:47 AM, Martin Morgan wrote:
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
since x and bx are ordered (sorting x would be expensive), the C code seems pretty straight-forward and memory-efficient -- create a result vector as long as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i]. (I think R's implementation of mean() actually pays more attention to numerical error, but with this much data...) library(inline) binmean <- cfunction(signature(x="numeric", bx="numeric"), " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
I'll take my solution back. The problem specification says that x has
10,000-millions of elements, so we need to use R-devel and
R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
but there are two further issues. The first is that on my system
p$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have __SIZEOF_SIZE_T__ 8 but
(a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I end
up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using the
inline package, to avoid that level of uncertainty) and then
SEXP ans = PROTECT(NEW_NUMERIC(nb));
double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
sum = j = n = 0;
for (i = 0; i < nx; ++i) {
(b) because nx is a signed type, and since nx > .Machine$integer.max is represented as a negative number, I don't ever iterate this loop. So I'd have to be more clever if I wanted this to work on all platforms. For what it's worth, Herve's solution is also problematic > xx = findInterval(bx, x) Error in findInterval(bx, x) : long vector 'vec' is not supported A different strategy for the problem at hand would seem to involve iteration over sequences of x, collecting sufficient statistics (n, sum) for each iteration, and calculating the mean at the end of the day. This might also result in better memory use and allow parallel processing. Martin
while (xp[i] >= bxp[j]) {
ansp[j++] = n > 0 ? sum / n : 0;
sum = n = 0;
}
n += 1;
sum += xp[i];
}
ansp[j] = n > 0 ? sum / n : 0;
UNPROTECT(1);
return ans;
")
with a test case
nx <- 4e7
nb <- 1e3
x <- sort(runif(nx))
bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
leading to
> bx1 <- c(bx[-1], bx[nb] + 1) > system.time(res <- binmean(x, bx1))
user system elapsed 0.052 0.000 0.050 Martin
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Thank you all - I appreciate all your suggestions. My post was mainly
to check if there's already an existing and fast implementation out
there. I've ended up adopting Martin's first proposal. Something
like this:
#include <Rdefines.h>
#include <R.h>
#include <R_ext/Error.h>
SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
SEXP ans = PROTECT(NEW_NUMERIC(nb));
double *ansp = REAL(ans);
int retcount = LOGICAL(retCount)[0];
SEXP count = NULL;
int *countp = NULL;
int i = 0, j = 0, n = 0, iStart=0;
double sum = 0.0;
// Assert same lengths of 'x' and 'y'
if (ny != nx) {
error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
}
if (retcount) {
count = PROTECT(NEW_INTEGER(nb));
countp = INTEGER(count);
}
// Skip to the first bin
while (xp[iStart] < bxp[0]) {
++iStart;
}
// For each x...
for (i = iStart; i < nx; ++i) {
// Skip to a new bin?
while (xp[i] >= bxp[j+1]) {
// Update statistic of current bin?
if (retcount) { countp[j] = n; }
ansp[j] = n > 0 ? sum / n : 0;
sum = 0.0;
n = 0;
// ...and move to next
++j;
}
sum += yp[i];
n += 1;
}
// Update statistic of last bin
ansp[j] = n > 0 ? sum / n : 0;
if (retcount) {
countp[j] = n;
setAttrib(ans, install("count"), count);
UNPROTECT(1); // 'count'
}
UNPROTECT(1); // 'ans'
return ans;
} // binMeans()
BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6)
and not a scalar (10^10), but the misinterpretation got Martin to
point out some of the exciting work extending R core to support "very
large" integers. And, as Herve pointed out, I forgot to say that the
number of bins could be of that range as well, or maybe an order of
magnitude less (say ~1-100 data points per bin).
/Henrik
On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 10/3/2012 6:47 AM, Martin Morgan wrote:
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary.
since x and bx are ordered (sorting x would be expensive), the C code seems pretty straight-forward and memory-efficient -- create a result vector as long as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i]. (I think R's implementation of mean() actually pays more attention to numerical error, but with this much data...) library(inline) binmean <- cfunction(signature(x="numeric", bx="numeric"), " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
I'll take my solution back. The problem specification says that x has
10,000-millions of elements, so we need to use R-devel and
R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
but there are two further issues. The first is that on my system
p$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have __SIZEOF_SIZE_T__ 8 but
(a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I
end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using
the inline package, to avoid that level of uncertainty) and then
SEXP ans = PROTECT(NEW_NUMERIC(nb));
double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
sum = j = n = 0;
for (i = 0; i < nx; ++i) {
(b) because nx is a signed type, and since nx > .Machine$integer.max is represented as a negative number, I don't ever iterate this loop. So I'd have to be more clever if I wanted this to work on all platforms. For what it's worth, Herve's solution is also problematic
xx = findInterval(bx, x)
Error in findInterval(bx, x) : long vector 'vec' is not supported A different strategy for the problem at hand would seem to involve iteration over sequences of x, collecting sufficient statistics (n, sum) for each iteration, and calculating the mean at the end of the day. This might also result in better memory use and allow parallel processing. Martin
while (xp[i] >= bxp[j]) {
ansp[j++] = n > 0 ? sum / n : 0;
sum = n = 0;
}
n += 1;
sum += xp[i];
}
ansp[j] = n > 0 ? sum / n : 0;
UNPROTECT(1);
return ans;
")
with a test case
nx <- 4e7
nb <- 1e3
x <- sort(runif(nx))
bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
leading to
> bx1 <- c(bx[-1], bx[nb] + 1) > system.time(res <- binmean(x, bx1))
user system elapsed 0.052 0.000 0.050 Martin
Thanks, Henrik
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109