Splitting data.frame into a list of small data.frames given indices

Wed, Jun 29, 2016 2:16 AM

It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


 > nrows <- c(1e5,1e6,2e6,3e6,5e6)

+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }

$`1e+05`
   user  system elapsed
   0.05    0.00    0.05

$`1e+06`
   user  system elapsed
   1.48    2.98    4.46

$`2e+06`
   user  system elapsed
   7.25   11.39   18.65

$`3e+06`
   user  system elapsed
  16.15   25.81   41.99

$`5e+06`
   user  system elapsed
  43.22   74.72  118.09

Witold Eryk Wolski

Splitting data.frame into a list of small data.frames given indices

Thread (5 messages)