Hi, Here is what I want to do. I have a dataset containing 4.2 *million* rows and about 10 columns and want to do some statistics with it, mainly using it as a prediction set for GAM and GLM models. I tried to load it from a csv file but, after filling up memory and part of the swap (1 gb each), I get a segmentation fault and R stops. I use R under Linux. Here are my questions : 1) Has anyone ever tried to use such a big dataset? 2) Do you think that it would possible on a more powerfull machine, such as a cluster of computers? 3) Finaly, does R has some "memory limitation" or does it just depend on the machine I'm using? Best wishes Fabien Fivaz
Using huge datasets
3 messages · Fabien Fivaz, Roger D. Peng, Thomas Lumley
By my calculation, your dataset should occupy less than 400MB of RAM, so this is not a terribly large dataset (these days). But that's not including any possible attributes (like row names) which often also take up a lot of memory. Considering that a function like read.csv() makes a copy of the dataset your actual requirements are ~800MB, which for a 1GB machine may be too big depending on what else the computer is doing. I have successfully loaded *much* bigger datasets into R (2-4GB) without a problem. Some possible solutions are 1. Buy more RAM 2. Use scan(), which doesn't make a copy of the dataset 3. Use a 64-bit machine and buy even more RAM. Using a cluster of computers doesn't really help in this situation because there's no easy way to spread a dataset across multiple machines. So you will still be limited by the memory on a single machine. As far as I know, R does not have a "memory limitation" -- the only limit is the memory installed on your computer. -roger
Fabien Fivaz wrote:
Hi, Here is what I want to do. I have a dataset containing 4.2 *million* rows and about 10 columns and want to do some statistics with it, mainly using it as a prediction set for GAM and GLM models. I tried to load it from a csv file but, after filling up memory and part of the swap (1 gb each), I get a segmentation fault and R stops. I use R under Linux. Here are my questions : 1) Has anyone ever tried to use such a big dataset? 2) Do you think that it would possible on a more powerfull machine, such as a cluster of computers? 3) Finaly, does R has some "memory limitation" or does it just depend on the machine I'm using? Best wishes Fabien Fivaz
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On Wed, 4 Feb 2004, Roger D. Peng wrote:
As far as I know, R does not have a "memory limitation" -- the only limit is the memory installed on your computer.
The only practical limitation is the pointer size of your machine, so 32-bit machine can't address more than 4Gb, and R probably won't get all of that. Further out, R will run into problems if you try to have a vector with more than 2^31 elements (since length() returns an integer), and probably if you have more than 2^31 objects. I would guess that there are tighter limitations implied by the .rda save format. -thomas