Skip to content
Prev 314183 / 398506 Next

aggregate / collapse big data frame efficiently

According to the way that you have used 'aggregate', you are taking
the column means.  Couple of suggestions for faster processing:


1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'

I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:
int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
   user  system elapsed
   1.33    0.00    1.35
num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
  ..$ : chr [1:10] "1" "2" "3" "4" ...
On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote: