Just wondering if anyone has opinions on the various big data packages for R, ff vs bigmemory vs anything else. Is anyone working on or is there already a package for connecting hdf5 to R for handling huge datasets as opposed to the hdf5 package which basically just allows you to read everything in an hdf5 at once. I was also wondering what advantages ff/bigmemory may have to existing libraries in C for the same the most prominent of which is probably hdf5. --Ashwin
Big Data packages
5 messages · Ashwin Kapur, Daniel Cegiełka, Andrew Piskorski +1 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20100317/a7a96591/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20100317/f53c42ca/attachment.pl>
On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:
Just wondering if anyone has opinions on the various big data packages for R, ff vs bigmemory vs anything else. Is anyone working on or is there
I don't really know. However, since both ff and bigmemory are intended for use with giant larger-than-RAM matrices via memory-mapped files on disk, back c. October 2009 I briefly tried out both in order to answer one question: Is either package a straightforward drop-in replacement for EXISTING code manipulating large R matrices, in order to reduce R's massive (and probably quite inefficient) memory use in such cases? The short answer is no, they're not. Neither one even really attempts to work transparently as a matrix in R. Both packages have major quirks and special behaviors which in practice seem to mean that you must write your code specifically for them. These include smaller things like is.na() or apply() not working, to conceptually bigger ones like pass-by-reference rather than the pass-by-value R uses everywhere else. And if you're writing special-case code, then other tools, like RSQLite or perhaps even Metakit, also become options. Note that I have no particular opinion on how useful ff or bigmemory are in general, I didn't even attempt to figure that out. And finally, some other out-there technologies to keep an eye on for potential use in massive data manipulation in R (but unlike the packages above, these probably are not usable with R right now): - If completed, Jean-Claude Wippler's Vlerq might well have been very useful for R, perhaps even as a unification of and upgrade to R's native matrix, array, and data frame data structures. Unfortunately that project is dead. It also sounded in some ways like what Kdb/Q do. - MonetDB is interesting, but may be too server-like for embedded use from R. - Alex van Ballegooij's "RAM" Relational Array Mapping extension for MonetDB sounds potentially relevant for R-like use of matrices, but it's not clear whether it actually worked for anything other than his PhD thesis. http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists - If SciDB gets anywhere, it might end up useful as an out-of-core multi-dimensional matrix back-end for R, even though it is intended more as an RDBMS server rather than a lightweight library.
Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/
I think if you are looking for a matrix-like replacement, you should probably look at Jeff Ryan (author of xts, quantmod, others) 'indexing' package. It is very 'R-like' in its usage and subsetting, holding the 'index' in memory. It turns out to be faster than bigmemory for most types of access. - Brian
Andrew Piskorski wrote:
On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:
Just wondering if anyone has opinions on the various big data packages for
R, ff vs bigmemory vs anything else. Is anyone working on or is there
I don't really know. However, since both ff and bigmemory are intended for use with giant larger-than-RAM matrices via memory-mapped files on disk, back c. October 2009 I briefly tried out both in order to answer one question: Is either package a straightforward drop-in replacement for EXISTING code manipulating large R matrices, in order to reduce R's massive (and probably quite inefficient) memory use in such cases? The short answer is no, they're not. Neither one even really attempts to work transparently as a matrix in R. Both packages have major quirks and special behaviors which in practice seem to mean that you must write your code specifically for them. These include smaller things like is.na() or apply() not working, to conceptually bigger ones like pass-by-reference rather than the pass-by-value R uses everywhere else. And if you're writing special-case code, then other tools, like RSQLite or perhaps even Metakit, also become options. Note that I have no particular opinion on how useful ff or bigmemory are in general, I didn't even attempt to figure that out. And finally, some other out-there technologies to keep an eye on for potential use in massive data manipulation in R (but unlike the packages above, these probably are not usable with R right now): - If completed, Jean-Claude Wippler's Vlerq might well have been very useful for R, perhaps even as a unification of and upgrade to R's native matrix, array, and data frame data structures. Unfortunately that project is dead. It also sounded in some ways like what Kdb/Q do. - MonetDB is interesting, but may be too server-like for embedded use from R. - Alex van Ballegooij's "RAM" Relational Array Mapping extension for MonetDB sounds potentially relevant for R-like use of matrices, but it's not clear whether it actually worked for anything other than his PhD thesis. http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists - If SciDB gets anywhere, it might end up useful as an out-of-core multi-dimensional matrix back-end for R, even though it is intended more as an RDBMS server rather than a lightweight library.
Brian G. Peterson http://braverock.com/brian/ Ph: 773-459-4973 IM: bgpbraverock