Skip to content

How to quickly convert a data.frame into a structure of lists

10 messages · Frederic F, Duncan Mackay, Duncan Murdoch +3 more

#
Hello,

This is my first project in R, so I'm trying to work 'the R way', but it
still feels awkward sometimes.

The problem that I'm facing right now is that I need to convert a data.frame
into a structure of lists. The data.frame has columns in the order of tens
(I need to focus on only three of them) and rows in the order of millions.
So it's quite a big dataset. 
Let say that the columns of interest are A, B and C. I need to take the
data.frame and construct a structure of list where I have a list for every
level of A, those list all contain lists for every levels of B, and the
'b-lists' contains all the values of C that match the corresponding levels
of A and B. 
So, I should be able to write something like this:
and get a vector of the values of C that were on rows where A=x_level_of_A
and B=y_level_of_B.

My first attempt was to use two imbricated "lapply" functions running
something like this:

list_structure<-lapply(levels(A) function(x) {
  as.character(x) = lapply( levels(B), function(y) {
    as.character(y) = C[A==x & B==y]
  })
})

The real code was not quite as simple, but I managed to have it work, and it
worked well on my first dataset (where A and B had only few levels). I was
quite happy... but the imbricated loops killed me on a second dataset where
A had several thousand levels. So I tried something else.

My second attempt was to go through every row of the data.frame and append
the value to the appropriate vector. 

I first initialized a structure of lists ending with NULL vector, then I did
something like this:

for (i in 1:nrow(DataFrame)) {
  eval(
    substitute(
      append(MyData at list_structure$a_value$b_value, c_value),
      list(a_value=as.character(DF$A[i]), b_value=as.character(DF$B[i]),
c_value=as.character(DF$C[i]))
    )
  )
}

This works... but way too slowly for my purpose. 

I would like to know if there is a better road to take to do this
transformation. Or, if there is a way of speeding one of the two solutions
that I have tried.

Thank you very much for your help!

(And in your replies, please remember that this is my first project in R, so
don't hesitate to state the obvious if it seems like I am missing it!)

Frederic

--
View this message in context: http://r.789695.n4.nabble.com/How-to-quickly-convert-a-data-frame-into-a-structure-of-lists-tp3731746p3731746.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi

Something to get you started
? as.list
a data.frame can be regarded as a 2 dimensional array of list vectors

df = data.frame(a=1:2,b=2:1,c=4:5,d=9:10)
as.list(df[,1:3])
$a
[1] 1 2

$b
[1] 2 1

$c
[1] 4 5

see also
http://cran.ms.unimelb.edu.au/doc/contrib/Burns-unwilling_S.pdf

Regards

Duncan


Duncan Mackay
Department of Agronomy and Soil Science
University of New England
ARMIDALE NSW 2351
Email: home mackay at northnet.com.au
At 10:58 10/08/2011, you wrote:
#
I would use the tapply function (which is designed for the case in which 
data exists for most pairs of the levels of A and B) or the 
reshape::sparseby function, or something else in the reshape package. 
These won't give you exactly the structure you were asking for, but they 
will separate the data properly.

By the way, it's a good idea when posting a question to post a simple 
example; then other solutions can be illustrated on the same example. 
It doesn't need to contain millions of rows.

Duncan Murdoch
On 11-08-09 8:58 PM, Frederic F wrote:
> Hello,
 >
 > This is my first project in R, so I'm trying to work 'the R way', but it
 > still feels awkward sometimes.
 >
 > The problem that I'm facing right now is that I need to convert a 
data.frame
 > into a structure of lists. The data.frame has columns in the order of 
tens
 > (I need to focus on only three of them) and rows in the order of 
millions.
 > So it's quite a big dataset.
 > Let say that the columns of interest are A, B and C. I need to take the
 > data.frame and construct a structure of list where I have a list for 
every
 > level of A, those list all contain lists for every levels of B, and the
 > 'b-lists' contains all the values of C that match the corresponding 
levels
 > of A and B.
 > So, I should be able to write something like this:
 >> MyData at list_structure$x_level_of_A$y_level_of_B
 > and get a vector of the values of C that were on rows where 
A=x_level_of_A
 > and B=y_level_of_B.
 >
 > My first attempt was to use two imbricated "lapply" functions running
 > something like this:
 >
 > list_structure<-lapply(levels(A) function(x) {
 >    as.character(x) = lapply( levels(B), function(y) {
 >      as.character(y) = C[A==x&  B==y]
 >    })
 > })
 >
 > The real code was not quite as simple, but I managed to have it work, 
and it
 > worked well on my first dataset (where A and B had only few levels). 
I was
 > quite happy... but the imbricated loops killed me on a second dataset 
where
 > A had several thousand levels. So I tried something else.
 >
 > My second attempt was to go through every row of the data.frame and 
append
 > the value to the appropriate vector.
 >
 > I first initialized a structure of lists ending with NULL vector, 
then I did
 > something like this:
 >
 > for (i in 1:nrow(DataFrame)) {
 >    eval(
 >      substitute(
 >        append(MyData at list_structure$a_value$b_value, c_value),
 >        list(a_value=as.character(DF$A[i]), b_value=as.character(DF$B[i]),
 > c_value=as.character(DF$C[i]))
 >      )
 >    )
 > }
 >
 > This works... but way too slowly for my purpose.
 >
 > I would like to know if there is a better road to take to do this
 > transformation. Or, if there is a way of speeding one of the two 
solutions
 > that I have tried.
 >
 > Thank you very much for your help!
 >
 > (And in your replies, please remember that this is my first project 
in R, so
 > don't hesitate to state the obvious if it seems like I am missing it!)
 >
 > Frederic
 >
 > --
 > View this message in context: 
http://r.789695.n4.nabble.com/How-to-quickly-convert-a-data-frame-into-a-structure-of-lists-tp3731746p3731746.html
 > Sent from the R help mailing list archive at Nabble.com.
 >
 > ______________________________________________
 > R-help at r-project.org mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
 > and provide commented, minimal, self-contained, reproducible code.
#
To borrow shamelessly from one of the prominent helpers on this list:

"What is the problem you're trying to solve?"    (attribution: Jim Holtman)

I have the sense you want to do something over many subsets of your
data frame. If so, breaking things up into lists of lists of lists is
not necessarily productive, nor may it be necessary to use loops
explicitly, depending on the nature of what you want to do. If you're
more explicit about the nature of your task, it's entirely possible
that there may be a nice 'R way' to do it. Read the posting guide and
if at all possible, provide a small, reproducible example that
demonstrates what you want to accomplish.
(See ?dput to learn how to transmit data by e-mail.)

HTH,
Dennis
On Tue, Aug 9, 2011 at 5:58 PM, Frederic F <fournier.frederic at gmail.com> wrote:
#
Hello Duncan,  

Here is a small example to illustrate what I am trying to do.

# Example data.frame
df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4)) 
#   A B C
# 1 a X 1
# 2 a X 2
# 3 b Y 3
# 4 b Z 4

### First way of getting the list structure (ls1) using imbricated lapply
loops:
# Get the structure and populate it:
ls1<-lapply(levels(df$A), function(levelA) { 
      lapply(levels(df$B), function(levelB) {df$C[df$A==levelA &
df$B==levelB]})
})
# Apply the names:
names(list_structure)<-levels(df$A)
for (i in 1:length(list_structure))
{names(list_structure[[i]])<-levels(df$B)}

# Result:
ls1$a$X
# [1] 1 2
ls1$b$Z
# [1] 4

The data.frame will always be 'complete', i.e., there will be a value in
every row for every column. 
I want to produce a structure like this one quickly (I aim at something
below 10 seconds) for a dataset containing between 1 and 2 millions of rows. 

I hope that this helps clarify things.

Thanks for your help,

Frederic 

--
View this message in context: http://r.789695.n4.nabble.com/How-to-quickly-convert-a-data-frame-into-a-structure-of-lists-tp3731746p3733073.html
Sent from the R help mailing list archive at Nabble.com.
#
On 10/08/2011 10:30 AM, Frederic F wrote:
I don't know what the timing would be like for your real data, but this 
does look like by() would work:

ls1 <- by(df$C, df[,1:2], identity)

When I repeat the rows of df a million times each, this finishes in a 
few seconds.  It would definitely be slower if there were more levels of 
A or B.

Now ls1 will be a matrix whose entries are the subsets of C that you 
want, so you can see your two results with slightly different syntax:

 > ls1[["a", "X"]]
[1] 1 2
 > ls1[["b","Z"]]
[1] 4

Duncan Murdoch
#
Hi Frederic,
shouldn't there be an result for the 3rd row as well, eg ls1$b$Y?

Maybe this will do what you want?

dtf<-within(dtf,index<-factor(A:B))
tapply(dtf$C,dtf$index,list)

Hth.

Am 10.08.2011 16:30, schrieb Frederic F:

  
    
#
I was going to suggest
  > AB <- df[c("A","B")]
  > ls2 <- array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
which produces a matrix very similar to what Duncan's by() call produces
  > ls1 <- by(df$C, df[,1:2], identity)
E.g.,
  > ls2[["a","X"]]
  [1] 1 2
  > ls1[["a","X"]]
  [1] 1 2
  > ls1[["a","Y"]] # by assigns NULL to unoccupied slots
  NULL
  > ls2[["a","Y"]] # split gives the same type to all slots, copied from input
  numeric(0)

They both are quick because they use split() to avoid the repeated
evaluations of
  bigVector[ anotherBigVector == scalar ]
that your nested (not imbricated) loops do.  If you really need to convert
the matrix to a list of lists that will probably be a quick transformation.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Here is code to transform the matrix that by() or array(split())
produces, along with an example of the speed of the various
approaches.  Using split(), either directly or via by() or tapply(),
saves a lot of time.

f0 <- function(df) {
    # original code with typos fixed.
    list_structure <- lapply(levels(df$A), function(levelA) {
        lapply(levels(df$B), function(levelB) {df$C[df$A==levelA & df$B==levelB]})
    })
    # Apply the names:
    names(list_structure)<-levels(df$A)
    for (i in 1:length(list_structure)) {
        names(list_structure[[i]])<-levels(df$B)
    }
    list_structure
}

f0a <- function(df) {
    # slightly faster version of f0, taking repeated
    # calculations out of loops.
    A <- df$A
    B <- df$B
    C <- df$C
    levelsA <- structure(levels(A), names=levels(A))
    levelsB <- structure(levels(B), names=levels(B))
    lapply(levelsA, function(levelA) {
            tmpA <- A == levelA # this is responsible for most of the savings
            lapply(levelsB, function(levelB) {C[tmpA & B==levelB]})
    })
}

f1 <- function(df) {
    # DM's code
    by(df$C, df[,1:2], identity)
}

f2 <- function(df) {
    # WD's code
    AB<- df[c("A","B")]
    array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
}

matrix2ListOfRows <- function(mat) {
    # convert a matrix to a list of its rows, converting dimnames to names.
    retval <- structure(as.vector(mat), names=rep(colnames(mat), each=nrow(mat)))
    retval <- split(retval, row(mat))
    names(retval) <- rownames(mat)
    retval
}

The test involves 10^5 rows of data with 26 levels for A and 200 for B.
+                  B=factor(sample(r200, size=1e5, replace=TRUE), levels=r200),
+                  C=1:1e5)
user  system elapsed 
  74.08    2.34   76.60
user  system elapsed 
  43.09    0.44   43.73
[1] TRUE
user  system elapsed 
   0.09    0.02    0.11
[1] TRUE
user  system elapsed 
   0.69    0.00    0.69
[1] TRUE


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com