Skip to content
Prev 314185 / 398506 Next

aggregate / collapse big data frame efficiently

Hi,

Jim's method was found to be faster than data.table()

n <- 10000
?nLevels <- 10
?nRows <- 120
?Cols <- list(rep(list(sample(nRows)), n))
?df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
?colnames(df)[-1] <- paste0('col', 1:n) 
?# convert to matrix for faster processing
?df.m <- as.matrix(df[, -1])? # remove levels column 
?system.time({
?# split the indices of rows for each level
?x <- split(seq(nrow(df)), df$levels)
?result <- sapply(x, function(a) colMeans(df.m[a, ]))
?}) 
?# user? system elapsed 
# 0.056?? 0.000?? 0.056?? 


library(data.table)
df.dt<-data.table(df)
setkey(df.dt,levels)
?system.time({ result1<- df.dt[,lapply(.SD,mean),by=levels]})
#? user? system elapsed 
#? 7.756?? 0.000?? 7.771 
?system.time({result2<-df.dt[,list(Mean=colMeans(.SD)),by=levels]})
# user? system elapsed 
?# 2.188?? 0.000?? 2.193? 


A.K.



----- Original Message -----
From: jim holtman <jholtman at gmail.com>
To: Martin Batholdy <batholdy at googlemail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, December 25, 2012 1:20 PM
Subject: Re: [R] aggregate / collapse big data frame efficiently

According to the way that you have used 'aggregate', you are taking
the column means.? Couple of suggestions for faster processing:


1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'

I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:
int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
- attr(*, "dimnames")=List of 2
? ..$ : NULL
? ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
?  user? system elapsed
?  1.33? ? 0.00? ? 1.35
num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ...
- attr(*, "dimnames")=List of 2
? ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
? ..$ : chr [1:10] "1" "2" "3" "4" ...
On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote: