Processing large datasets
Take a look at the High-Performance and Parallel Computing with R CRAN Task View: http://cran.us.r-project.org/web/views/HighPerformanceComputing.html specifically at the section labeled "Large memory and out-of-memory data". There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all. I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details. You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information. HTH, Marc Schwartz
On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:
Thanks Jonathan. I'm already using RMySQL to load data for couple of days. I wanted to know what are the relevant R capabilities if I want to process much bigger tables. R always reads the whole set into memory and this might be a limitation in case of big tables, correct? Doesn't it use temporary files or something similar to deal such amount of data? As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. --Roman ----- Original Message -----
In cases where I have to parse through large datasets that will not fit into R's memory, I will grab relevant data using SQL and then analyze said data using R. There are several packages designed to do this, like [1] and [2] below, that allow you to query a database using SQL and end up with that data in an R data.frame.
[1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:
Hi R list, I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed. This is a typical set of data. Each day contains millions of records (up to 10 non filtered). 2011-05-24 750 Bid DELL 14130770 400 15.4800 BATS 35482391 Y 1 1 0 0 2011-05-24 904 Bid DELL 14130772 300 15.4800 BATS 35482391 Y 1 0 0 0 2011-05-24 904 Bid DELL 14130773 135 15.4800 BATS 35482391 Y 1 0 0 0 I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already. Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day. Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests. So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R? Thanks, --Roman N.