Any interest in "merge" and "by" implementations specifically for so
Hi Tom,
Now, try sorting and using a loop:
idx <- order(i)
xs <- x[idx]
is <- i[idx]
res <- array(NA, 1e6)
idx <- which(diff(is) > 0)
startidx <- c(1, idx+1)
endidx <- c(idx, length(xs))
f1 <- function(x, startidx, endidx, FUN = sum) {
+ for (j in 1:length(res)) {
+ res[j] <- FUN(x[startidx[j]:endidx[j]])
+ }
+ res
+ }
unix.time(res1 <- f1(xs, startidx, endidx))
[1] 6.86 0.00 7.04 NA NA
I wonder how much time the sorting, reordering and creation os startidx and endidx would add to this time? Either way, your code can nicely be used to quickly create the small integer factors I would need if the igroup functions get integrated. Thanks!
For the case of sum (or averages), you can vectorize this using cumsum as follows. This won't work for median or max.
f2 <- function(x, startidx, endidx) {
+ cum <- cumsum(x) + res <- cum[endidx] + res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length (res) - 1)]] + res + }
unix.time(res2 <- f2(xs, startidx, endidx))
[1] 0.20 0.00 0.21 NA NA
Yes that is a quite fast way to handle "sums".
You can also use Luke Tierney's byte compiler (http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for functions where you can't vectorize:
library(compiler) f3 <- cmpfun(f1)
Note: local functions used: FUN
unix.time(res3 <- f3(xs, startidx, endidx))
[1] 3.84 0.00 3.91 NA NA
That looks interesting. Does it only work for specific operating systems and processors? I will give it a try. Thanks, Kevin