Skip to content

Splitting data.frame into a list of small data.frames given indices

5 messages · Witold E Wolski, Rolf Turner, Ivan Calandra +1 more

#
It's the inverse problem to merging a list of data.frames into a large
data.frame just discussed in the "performance of do.call("rbind")"
thread

I would like to split a data.frame into a list of data.frames
according to first column.
This SEEMS to be easily possible with the function base::by. However,
as soon as the data.frame has a few million rows this function CAN NOT
BE USED (except you have A PLENTY OF TIME).

for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).

So basically I am looking for a similar function with better complexity.


 > nrows <- c(1e5,1e6,2e6,3e6,5e6)
+ dum <- peaks[1:i,]
+ timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
+ }
$`1e+05`
   user  system elapsed
   0.05    0.00    0.05

$`1e+06`
   user  system elapsed
   1.48    2.98    4.46

$`2e+06`
   user  system elapsed
   7.25   11.39   18.65

$`3e+06`
   user  system elapsed
  16.15   25.81   41.99

$`5e+06`
   user  system elapsed
  43.22   74.72  118.09
#
On 29/06/16 21:16, Witold E Wolski wrote:
I'm not sure that I follow what you're doing, and your example is not 
reproducible, since we have no idea what "peaks" is, but on a toy 
example with 5e6 rows in the data frame I got a timing result of

    user  system elapsed
   0.379   0.025   0.406

when I applied split().  Is this adequately fast? Seems to me that if 
you want to split something, split() would be a good place to start.

cheers,

Rolf Turner
#
Hi,

Here is an complete example which shows the the complexity of split or
by is O(n^2)

nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
res<-list()

for(i in nrows){
  dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
  res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
}
res <- do.call("rbind",res)
plot(nrows^2, res[,"elapsed"])

And I can't see a reason why this has to be so slow.


cheers
On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz> wrote:

  
    
#
Hi,

I don't really understand why you split every row... This makes it very 
slow. Try with a more realistic example (with a factor to split).

Ivan

--
Ivan Calandra, PhD
Scientific Mediator
University of Reims Champagne-Ardenne
GEGENAA - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calandra at univ-reims.fr
--
https://www.researchgate.net/profile/Ivan_Calandra
https://publons.com/author/705639/

Le 29/06/2016 ? 15:21, Witold E Wolski a ?crit :
#
I won't go into why splitting data.frames (or factors) uses time
proportional to the number of input rows times the number of
levels in the splitting factor, but you will get much better mileage
if you call split individually on each 'atomic' (numeric, character, ...)
variable and use mapply on the resulting lists.

The plyr and dplyr packages were developed to deal with this
sort of problem.  Check them out.


Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <wewolski at gmail.com> wrote: