A matrix of that size takes up just over 320MB to store in memory. I'd imagine you probably can do it with 2GB physical RAM (assuming your `columns' are all numeric variables; i.e., no factors). However, perhaps better way than the brute-force, one-shot way, is to read in the data in chunks and do the prediction piece by piece. You can use scan(), or open()/readLines()/close() to do this fairly easily. My understanding of how (most) clusters work is that you need at least one node that will accommodate the memory load for the monolithic R process, so probably not much help. (I could very well be wrong about this. If so, I'd be very grateful for correction.) HTH, Andy
From: Fabien Fivaz Hi, Here is what I want to do. I have a dataset containing 4.2 *million* rows and about 10 columns and want to do some statistics with it, mainly using it as a prediction set for GAM and GLM models. I tried to load it from a csv file but, after filling up memory and part of the swap (1 gb each), I get a segmentation fault and R stops. I use R under Linux. Here are my questions : 1) Has anyone ever tried to use such a big dataset? 2) Do you think that it would possible on a more powerfull machine, such as a cluster of computers? 3) Finaly, does R has some "memory limitation" or does it just depend on the machine I'm using? Best wishes Fabien Fivaz
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}