Processing large datasets

Wed, May 25, 2011 7:46 AM

Take a look at the High-Performance and Parallel Computing with R CRAN Task View:

  http://cran.us.r-project.org/web/views/HighPerformanceComputing.html

specifically at the section labeled "Large memory and out-of-memory data".

There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all.

I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details.

You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information.

HTH,

Marc Schwartz

On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:

Thanks Jonathan. 

I'm already using RMySQL to load data for couple of days. 
I wanted to know what are the relevant R capabilities if I want to process much bigger tables. 

R always reads the whole set into memory and this might be a limitation in case of big tables, correct? 
Doesn't it use temporary files or something similar to deal such amount of data? 

As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. 

--Roman 

----- Original Message -----

In cases where I have to parse through large datasets that will not
fit into R's memory, I will grab relevant data using SQL and then
analyze said data using R. There are several packages designed to do
this, like [1] and [2] below, that allow you to query a database
using
SQL and end up with that data in an R data.frame.

[1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
[2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
<roman at bestroman.com> wrote:

Hi R list,

I'm new to R software, so I'd like to ask about it is capabilities.
What I'm looking to do is to run some statistical tests on quite
big
tables which are aggregated quotes from a market feed.

This is a typical set of data.
Each day contains millions of records (up to 10 non filtered).

2011-05-24 750 Bid DELL 14130770 400
15.4800 BATS 35482391 Y 1 1 0 0
2011-05-24 904 Bid DELL 14130772 300
15.4800 BATS 35482391 Y 1 0 0 0
2011-05-24 904 Bid DELL 14130773 135
15.4800 BATS 35482391 Y 1 0 0 0

I'll need to filter it out first based on some criteria.
Since I keep it mysql database, it can be done through by query.
Not
super efficient, checked it already.

Then I need to aggregate dataset into different time frames (time
is
represented in ms from midnight, like 35482391).
Again, can be done through a databases query, not sure what gonna
be faster.
Aggregated tables going to be much smaller, like thousands rows per
observation day.

Then calculate basic statistic: mean, standard deviation, sums etc.
After stats are calculated, I need to perform some statistical
hypothesis tests.

So, my question is: what tool faster for data aggregation and
filtration
on big datasets: mysql or R?

Thanks,
--Roman N.

Processing large datasets

Thread (14 messages)