aggregate / collapse big data frame efficiently
Hi,
Jim's method was found to be faster than data.table()
n <- 10000
?nLevels <- 10
?nRows <- 120
?Cols <- list(rep(list(sample(nRows)), n))
?df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
?colnames(df)[-1] <- paste0('col', 1:n)
?# convert to matrix for faster processing
?df.m <- as.matrix(df[, -1])? # remove levels column
?system.time({
?# split the indices of rows for each level
?x <- split(seq(nrow(df)), df$levels)
?result <- sapply(x, function(a) colMeans(df.m[a, ]))
?})
?# user? system elapsed
# 0.056?? 0.000?? 0.056??
library(data.table)
df.dt<-data.table(df)
setkey(df.dt,levels)
?system.time({ result1<- df.dt[,lapply(.SD,mean),by=levels]})
#? user? system elapsed
#? 7.756?? 0.000?? 7.771
?system.time({result2<-df.dt[,list(Mean=colMeans(.SD)),by=levels]})
# user? system elapsed
?# 2.188?? 0.000?? 2.193?
A.K.
----- Original Message -----
From: jim holtman <jholtman at gmail.com>
To: Martin Batholdy <batholdy at googlemail.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, December 25, 2012 1:20 PM
Subject: Re: [R] aggregate / collapse big data frame efficiently
According to the way that you have used 'aggregate', you are taking
the column means.? Couple of suggestions for faster processing:
1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'
I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:
? n <- 100000
nLevels <- 10
nRows <- 120
Cols <- list(rep(list(sample(nRows)), n))
df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
colnames(df)[-1] <- paste0('col', 1:n)
# convert to matrix for faster processing
df.m <- as.matrix(df[, -1])? # remove levels column
str(df.m)
int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ... - attr(*, "dimnames")=List of 2 ? ..$ : NULL ? ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...
system.time({
+ # split the indices of rows for each level + x <- split(seq(nrow(df)), df$levels) + result <- sapply(x, function(a) colMeans(df.m[a, ])) + }) ? user? system elapsed ? 1.33? ? 0.00? ? 1.35
str(result)
num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ... - attr(*, "dimnames")=List of 2 ? ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ... ? ..$ : chr [1:10] "1" "2" "3" "4" ...
On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote:
Hi, I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable; here is the sample code: x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) aggregate(x, list(x[,1]), mean) Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) ? and it takes very very long (actually at some point I just stopped it). Is there anything that can be done to make the aggregate routine more efficient? Or is there a different approach that would work faster? Thanks for any suggestions!
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.