Efficient cbind of elements from two lists
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Stephan Dlugosz
Sent: Thursday, November 19, 2009 7:03 AM
To: r-help at r-project.org
Subject: [R] Efficient cbind of elements from two lists
Hi!
I have a data.frame "data" and splitted it.
data <- split(data, data[,1])
This is a quite slow procedure; and I do not want to do it again. So,
any unsplit and "resplit" is no option for me.
But: I have to cbind "variables" to the splitted data from
another list,
that contains of vectors with matching sizes, so
for (i in 1:length(data)) {
data[[i]] <- cbind(data[[i]], l[[i]]))
}
works well; but very, very slowly.
The lapply solution:
data <- lapply(1:k, function(i) cbind(data[[i]], l[[i]]))
does not improve the situation, but allows for mclapply from the
multicore package...
Is there a more efficient way to combine elements from two lists?
Can you restructure your analysis so you don't need
to split the data.frame itself? I'm assuming the split
was slow because there are a lot of groups. Splitting
a data.frame into lots of pieces is considerably slower
than splitting a few numeric or character columns in it.
> df <- data.frame(group=rep(1:1e5, each=2), score=1:2e5)
> system.time(split(df, df$group)) # split entire data.frame into 1e5
parts
user system elapsed
117.32 38.42 154.34
> system.time(split(df$score, df$group)) # split 2nd column into 1e5
parts
user system elapsed
0.43 0.03 0.46
If R does things the way S+ does this is because splitting
simple vectors is done in C code but splitting data.frames
invokes the S-language [.data.frame function, which is
relatively slow when selecting rows from a data.frame.
I'd suggest using ave() (or a function from the plyr package),
working on columns from your data.frame and adding ave's
output as a column in your big data.frame. E.g., to compute
the average score in each group
> system.time(df$meanScore <- ave(df$score, df$group, FUN=mean))
user system elapsed
3.37 0.00 3.50
> df[1:6,]
group score meanScore
1 1 1 1.5
2 1 2 1.5
3 2 3 3.5
4 2 4 3.5
5 3 5 5.5
6 3 6 5.5
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
Thank you very much! Greetings, Stephan
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.