Skip to content

Using huge datasets

3 messages · Fabien Fivaz, Roger D. Peng, Thomas Lumley

#
Hi,

Here is what I want to do. I have a dataset containing 4.2 *million* 
rows and about 10 columns and want to do some statistics with it, mainly 
using it as a prediction set for GAM and GLM models. I tried to load it 
from a csv file but, after filling up memory and part of the swap (1 gb 
each), I get a segmentation fault and R stops. I use R under Linux. Here 
are my questions :

1) Has anyone ever tried to use such a big dataset?
2) Do you think that it would possible on a more powerfull machine, such 
as a cluster of computers?
3) Finaly, does R has some "memory limitation" or does it just depend on 
the machine I'm using?

Best wishes

Fabien Fivaz
#
By my calculation, your dataset should occupy less than 
400MB of RAM, so this is not a terribly large dataset (these 
days).  But that's not including any possible attributes 
(like row names) which often also take up a lot of memory. 
Considering that a function like read.csv() makes a copy of 
the dataset your actual requirements are ~800MB, which for a 
1GB machine may be too big depending on what else the 
computer is doing.  I have successfully loaded *much* bigger 
datasets into R (2-4GB) without a problem.

Some possible solutions are

1. Buy more RAM
2. Use scan(), which doesn't make a copy of the dataset
3. Use a 64-bit machine and buy even more RAM.

Using a cluster of computers doesn't really help in this 
situation because there's no easy way to spread a dataset 
across multiple machines.  So you will still be limited by 
the memory on a single machine.

As far as I know, R does not have a "memory limitation" -- 
the only limit is the memory installed on your computer.

-roger
Fabien Fivaz wrote:
#
On Wed, 4 Feb 2004, Roger D. Peng wrote:

            
The only practical limitation is the pointer size of your machine, so
32-bit machine can't address more than 4Gb, and R probably won't get all
of that.

Further out, R will run into problems if you try to have a vector with
more than 2^31 elements (since length() returns an integer), and probably
if you have more than 2^31 objects.  I would guess that there are tighter
limitations implied by the .rda save format.

	-thomas