Skip to content

fast version of split.data.frame or conversion from data.frame to list of its rows

5 messages · Brian Ripley, Antonio Piccolboni, Simon Urbanek +1 more

#
On 01/05/2012 00:28, Antonio Piccolboni wrote:
Unsurprising when you create three orders of magnitude more data frames, 
is it?  That's a list of 2000 data frames.  Try

system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = 
paste0("x", i)))
You need to re-think your data structures: 1-row data frames are not 
sensible.

  
    
#
On May 1, 2012, at 1:26 PM, Antonio Piccolboni <antonio at piccolboni.info> wrote:

            
Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1).
See above - I think you are misunderstanding data frames - t() makes no sense for data frames.

Cheers,
Simon
2 days later
#
A bit late and possibly tangential. 

The mmap package has something called struct() which is really a row-wise array of heterogenous columns.

As Simon and others have pointed out, R has no way to handle this natively, but mmap does provide a very measurable performance gain by orienting rows together in memory (mapped memory to be specific).  Since it is all "outside of R" so to speak, it (mmap) even supports many non-native types, from bit vectors to 64 bit ints with conversion caveats applicable. 

example(struct) shows some performance gains with this approach. 

There are even some crude methods to convert as is data.frames to mmap struct object directly (hint: as.mmap)

Again, likely not enough to shoehorn into your effort, but worth a look to see if it might be useful, and/or see the C design underlying it. 

Best,
Jeff

Jeffrey Ryan    |    Founder    |    jeffrey.ryan at lemnica.com

www.lemnica.com
On May 1, 2012, at 1:44 PM, Antonio Piccolboni <antonio at piccolboni.info> wrote: