Skip to content
Prev 274019 / 398506 Next

SLOW split() function

thank you, everyone.  this was very helpful to my specific task and
understanding.  for the benefit of future googlers, I thought I would
post some experiments and results here.

ultimately, I need to do a by() on an irregular matrix, and I now know
how to speed up by() on a single-core, and then again on a multi-core
machine.

library(data.table)
N <- 1000*1000
d <- data.table(data.frame( key= as.integer(runif(N, min=1,
max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
setkey(d, "key"); gc() ## sort and force a garbage collection


cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")

cat("\nStandard by() Function:\n")
print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))


cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
print(system.time(si <- split(seq(nrow(d)), d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.2 <- lapply(si, function(.indx) {
coef(lm(d$y[.indx] ~ d$x[.indx])) })))
print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
print(system.time(ds <- split(d, d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
data=ds)) })))

the first and the last ways (all.1 and all.3) are "naive" ways of
doing this, and take about 400-500 seconds on a Mac Air, core i5.
Jim's suggestion (all.2) cuts this roughly into half by speeding up
the split to take almost no time.

and now,

library(multicore)
print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

on my dual-core (quad-thread) i5, all four pseudo cores become busy,
and the time roughly halves again from 230 seconds to 120 seconds.


maybe the by() function should use Jim's approach, and multicore
should provide mcby().  of course, knowing how to do this myself fast
now by hand, this is not so important for me.  but it may help some
other novices.

thanks again everybody.

regards,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)
On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <wdunlap at tibco.com> wrote: