Skip to content

Fast nested List->data.frame

4 messages · Greg Hirson, Dieter Menne

#
I have very large data sets given in a format similar to d below. Converting
these to a data frame is a bottleneck in my application. My fastest version
is given below, but it look clumsy to me.

Any ideas?

Dieter

# -----------------------
len = 100000
d = replicate(len, list(pH = 3,marker = TRUE,position = "A"),FALSE)
# Data are given as d

# preallocate vectors
pH =rep(0,len)
marker =rep(0,len)
position =rep(0,len)

system.time(
{
            for (i in 1:len)
            {
              d1 = d[[i]]
              #Assign to vectors
              pH[i] = d1[[1]]
              marker[i] = d1[[2]]
              position[i] = d1[[3]]
            }
        # combine vectors
        pHAll = data.frame(pH,marker,position)
}
)
#
Dieter,

I'd approach this by first making a matrix, then converting to a data 
frame with appropriate types. I'm sure there is a way to do it with 
structure in one step. Operations on matrices are usually faster than on 
dataframes.


len <- 100000
d <- replicate(len, list(pH = 3, marker = TRUE, position = "A"), FALSE)

toDF <- function(alist){
d.matrix <- matrix(unlist(alist), ncol = 3, byrow = TRUE)
d.df <- as.data.frame(d.matrix)
names(d.df) <- c('pH', 'marker', 'position')

d.df$pH <- as.numeric(d.df$pH)
d.df$marker <- as.logical(d.df$marker)
return(d.df)
}

on my system,
system.time(b<-toDF(d))

    user  system elapsed
   0.560   0.033   0.592

and

head(b)

   pH marker position
1  1   TRUE        A
2  1   TRUE        A
3  1   TRUE        A
4  1   TRUE        A
5  1   TRUE        A
6  1   TRUE        A

and

sapply(b, class)

        pH    marker  position
"numeric" "logical"  "factor"


I hope this helps,

Greg

sessionInfo()   ##old, I know.
R version 2.9.0 (2009-04-17)
i386-apple-darwin8.11.1

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] cimis_0.1-3     RLastFM_0.1-4   RCurl_0.98-1    bitops_1.0-4.1  
XML_2.5-3
[6] lattice_0.17-22

loaded via a namespace (and not attached):
[1] grid_2.9.0
On 1/4/10 11:43 PM, Dieter Menne wrote:

  
    
#
Greg Hirson wrote:
Well, I knew that matrixes are faster for numerics, but I also knew that the
required conversion to character would be a show-stopper. My second wisdom
was bogus. Your version is 30% faster on my computer.

Dieter
#
This is better by a factor of 4:

len = 100000
d = replicate(len, list(pH = 3,marker = TRUE,position = "A"),FALSE)
system.time(
{
    pHAll = data.frame(
      pH = unlist(lapply(d,"[[",1)),
      pH = unlist(lapply(d,"[[",2)),
      pH = unlist(lapply(d,"[[",3)))
}
)

Dieter