Skip to content

problems with large data

4 messages · Brian Ripley, PaTa PaTaS, Spencer Graves

#
Hello,
I exprerienced a problem with large data sets in R: I cannot import these data via procedure "read.table" (insuficient memory) and some other functions end with the same expception. Could you tell me how to handle large data sets in R?

Thank you. Pavel Vanecek
____________________________________________________________
Licitovat nejvyhodnejsi nab?dku je postavene na hlavu! Skoda Fabia nyni se zvyhodnenim az 50.000 Kc!http://ad2.seznam.cz/redir.cgi?instance=68739%26url=http://www.skoda-auto.cz/action/fast
#
On Fri, 9 Jan 2004, PaTa PaTaS wrote:

            
We need more details.  Have you followed all the hints in ?read.table and
the Data Import/Export manual?  If you have, then probably your data set
is too large for the memory of your version of R, and the simplest
solution is to get more memory.

To be more helpful we would need full details of the dataset and of the 
commands you used and the environment you are using (OS, how much RAM and 
how much virtual memory at least).
#
Thank you all for your help. The problem is not only with reading the data (5000 cases times 2000 integer variables, imported either from SPSS or TXT file) into my R 1.8.0 but also with the procedure I would like to use = "randomForest" from library "randomForest". It is not possible to run it with such a data set (because of the insuficient memory exception). Moreover, my data has factors with more than 32 classes, which causes another error.

Could you suggest any solution for my problem? Thank you a lot. 
____________________________________________________________
Licitovat nejvyhodnejsi nab?dku je postavene na hlavu! Skoda Octavia nyni se zvyhodnenim az 90.000 Kc! http://ad2.seznam.cz/redir.cgi?instance=68740%26url=http://www.skoda-auto.cz/action/fast
#
If you can't get more memory, you could read portions of the file 
using "scan(..., skip = ..., nlines = ...)" and then compress the data 
somehow to reduce the size of the object you pass to "randomForest".  
You could run "scan" like this in a loop each time processing, e.g., 10% 
of the data file. 

      Alternatively, you could pass each portion to "randomForest" and 
compare the results from several calls to "randomForest".  This would 
produce a type of cross validation, which might be a wise thing to do, 
anyway. 

      hope this helps. 
      spencer graves
PaTa PaTaS wrote: