-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jim Holtman
Sent: Monday, October 10, 2011 7:29 PM
To: ivo welch
Cc: r-help
Subject: Re: [R] SLOW split() function
instead of spliting the entire dataframe, split the indices and then use these to access your data:
try
system.time(s <- split(seq(nrow(d)), d$key))
this should be faster and less memory intensive. you can then use the indices to access the subset:
result <- lapply(s, function(.indx){
doSomething <- sum(d$someCol[.indx])
})
Sent from my iPad
On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
dear R experts: apologies for all my speed and memory questions. I
have a bet with my coauthors that I can make R reasonably efficient
through R-appropriate programming techniques. this is not just for
kicks, but for work. for benchmarking, my [3 year old] Mac Pro has
2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
right now, it seems that 'split()' is why I am losing my bet. (split
is an integral component of *apply() and by(), so I need split() to be
fast. its resulting list can then be fed, e.g., to mclapply().) I
made up an example to illustrate my ills:
library(data.table)
N <- 1000
T <- N*10
d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
setkey(d, "key"); gc() ## force a garbage collection
cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n")
print(system.time( s<-split(d, d$key) ))
My ordered input data table (or data frame; doesn't make a difference)
is 114MB in size. it takes about a second to create. split() only
needs to reshape it. this simple operation takes almost 5 minutes on
my computer.
with a data set that is larger, this explodes further.
am I doing something wrong? is there an alternative to split()?
sincerely,
/iaw
----
Ivo Welch (ivo.welch at gmail.com)