Big Data packages - R-SIG-HPC

Wed, Mar 17, 2010 1:26 PM #

Just wondering if anyone has opinions on the various big data packages for R, ff vs bigmemory vs anything else.  Is anyone working on or is there already a package for connecting hdf5 to R for handling huge datasets as opposed to the hdf5 package which basically just allows you to read everything in an hdf5 at once.  I was also wondering what advantages ff/bigmemory may have to existing libraries in C for the same the most prominent of which is probably hdf5.

--Ashwin

Daniel Cegiełka

Wed, Mar 17, 2010 1:41 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20100317/a7a96591/attachment.pl>

Daniel Cegiełka

Wed, Mar 17, 2010 1:43 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20100317/f53c42ca/attachment.pl>

Andrew Piskorski

Thu, Mar 18, 2010 8:57 AM #

On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:

I don't really know.  However, since both ff and bigmemory are
intended for use with giant larger-than-RAM matrices via memory-mapped
files on disk, back c. October 2009 I briefly tried out both in order
to answer one question:

Is either package a straightforward drop-in replacement for EXISTING
code manipulating large R matrices, in order to reduce R's massive
(and probably quite inefficient) memory use in such cases?

The short answer is no, they're not.  Neither one even really attempts
to work transparently as a matrix in R.  Both packages have major
quirks and special behaviors which in practice seem to mean that you
must write your code specifically for them.  These include smaller
things like is.na() or apply() not working, to conceptually bigger
ones like pass-by-reference rather than the pass-by-value R uses
everywhere else.

And if you're writing special-case code, then other tools, like
RSQLite or perhaps even Metakit, also become options.  Note that I
have no particular opinion on how useful ff or bigmemory are in
general, I didn't even attempt to figure that out.

And finally, some other out-there technologies to keep an eye on for
potential use in massive data manipulation in R (but unlike the
packages above, these probably are not usable with R right now):

- If completed, Jean-Claude Wippler's Vlerq might well have been very
  useful for R, perhaps even as a unification of and upgrade to R's
  native matrix, array, and data frame data structures.  Unfortunately
  that project is dead.  It also sounded in some ways like what Kdb/Q do.

- MonetDB is interesting, but may be too server-like for embedded use
  from R.

- Alex van Ballegooij's "RAM" Relational Array Mapping extension for
  MonetDB sounds potentially relevant for R-like use of matrices, but
  it's not clear whether it actually worked for anything other than
  his PhD thesis.
  http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists

- If SciDB gets anywhere, it might end up useful as an out-of-core
  multi-dimensional matrix back-end for R, even though it is intended
  more as an RDBMS server rather than a lightweight library.

Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/

Brian G. Peterson

Thu, Mar 18, 2010 9:10 AM #

I think if you are looking for a matrix-like replacement, you should 
probably look at Jeff Ryan (author of xts, quantmod, others)
'indexing' package.  It is very 'R-like' in its usage and subsetting, 
holding the 'index' in memory.  It turns out to be faster than bigmemory 
for most types of access.

  - Brian

Andrew Piskorski wrote:

On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:

Just wondering if anyone has opinions on the various big data packages for
R, ff vs bigmemory vs anything else.  Is anyone working on or is there

I don't really know.  However, since both ff and bigmemory are
intended for use with giant larger-than-RAM matrices via memory-mapped
files on disk, back c. October 2009 I briefly tried out both in order
to answer one question:

Is either package a straightforward drop-in replacement for EXISTING
code manipulating large R matrices, in order to reduce R's massive
(and probably quite inefficient) memory use in such cases?

The short answer is no, they're not.  Neither one even really attempts
to work transparently as a matrix in R.  Both packages have major
quirks and special behaviors which in practice seem to mean that you
must write your code specifically for them.  These include smaller
things like is.na() or apply() not working, to conceptually bigger
ones like pass-by-reference rather than the pass-by-value R uses
everywhere else.

And if you're writing special-case code, then other tools, like
RSQLite or perhaps even Metakit, also become options.  Note that I
have no particular opinion on how useful ff or bigmemory are in
general, I didn't even attempt to figure that out.

And finally, some other out-there technologies to keep an eye on for
potential use in massive data manipulation in R (but unlike the
packages above, these probably are not usable with R right now):

- If completed, Jean-Claude Wippler's Vlerq might well have been very
  useful for R, perhaps even as a unification of and upgrade to R's
  native matrix, array, and data frame data structures.  Unfortunately
  that project is dead.  It also sounded in some ways like what Kdb/Q do.

- MonetDB is interesting, but may be too server-like for embedded use
  from R.

- Alex van Ballegooij's "RAM" Relational Array Mapping extension for
  MonetDB sounds potentially relevant for R-like use of matrices, but
  it's not clear whether it actually worked for anything other than
  his PhD thesis.
  http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists

- If SciDB gets anywhere, it might end up useful as an out-of-core
  multi-dimensional matrix back-end for R, even though it is intended
  more as an RDBMS server rather than a lightweight library.

Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock