-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
Sent: Wednesday, August 10, 2011 10:05 AM
To: Duncan Murdoch; Frederic F
Cc: r-help at r-project.org
Subject: Re: [R] How to quickly convert a data.frame into a structure of lists
I was going to suggest
> AB <- df[c("A","B")]
> ls2 <- array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
which produces a matrix very similar to what Duncan's by() call produces
> ls1 <- by(df$C, df[,1:2], identity)
> ls1[["a","Y"]] # by assigns NULL to unoccupied slots
> ls2[["a","Y"]] # split gives the same type to all slots, copied from input
numeric(0)
They both are quick because they use split() to avoid the repeated
evaluations of
bigVector[ anotherBigVector == scalar ]
that your nested (not imbricated) loops do. If you really need to convert
the matrix to a list of lists that will probably be a quick transformation.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Duncan Murdoch
Sent: Wednesday, August 10, 2011 9:43 AM
To: Frederic F
Cc: r-help at r-project.org
Subject: Re: [R] How to quickly convert a data.frame into a structure of lists
On 10/08/2011 10:30 AM, Frederic F wrote:
Hello Duncan,
Here is a small example to illustrate what I am trying to do.
# Example data.frame
df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4))
# A B C
# 1 a X 1
# 2 a X 2
# 3 b Y 3
# 4 b Z 4
### First way of getting the list structure (ls1) using imbricated lapply
loops:
# Get the structure and populate it:
ls1<-lapply(levels(df$A), function(levelA) {
lapply(levels(df$B), function(levelB) {df$C[df$A==levelA&
df$B==levelB]})
})
# Apply the names:
names(list_structure)<-levels(df$A)
for (i in 1:length(list_structure))
{names(list_structure[[i]])<-levels(df$B)}
# Result:
ls1$a$X
# [1] 1 2
ls1$b$Z
# [1] 4
The data.frame will always be 'complete', i.e., there will be a value in
every row for every column.
I want to produce a structure like this one quickly (I aim at something
below 10 seconds) for a dataset containing between 1 and 2 millions of rows.
I don't know what the timing would be like for your real data, but this
does look like by() would work:
ls1 <- by(df$C, df[,1:2], identity)
When I repeat the rows of df a million times each, this finishes in a
few seconds. It would definitely be slower if there were more levels of
A or B.
Now ls1 will be a matrix whose entries are the subsets of C that you
want, so you can see your two results with slightly different syntax: