Memory/data -last time I promise
On Tue, 24 Jul 2001, Micheall Taylor wrote:
I've seen several posts over the past 2-3 weeks about memory issues. I've tried to carefully follow the suggestions, but remain baffled as to why I can't load data into R. I hope that in revisiting this issue that I don't exasperate the list. The setting: 1 gig RAM , Linux machine 10 Stata files of approximately 14megs each File contents appear at the end of this boorishly long email. Purpose: load and combine in R for further analysis Question: 1) I've placed memory queries in the command file to see what is going on. It appears that loading a 14meg file consumes approx 5 times this amount of memory - i.e. available memory declines by 70megs when a 14 meg dataset is loaded. (Seen in Method 2 below)
That's quite possible. A `14Mb dataset' is not too helpful to us. You seem to have one char (ca 2 chars) and 9 numeric variables per record. That's ca 75 bytes per record. An actual experiment and using object.size gives 88 (there are row names too). So at 70Mb, that is about 0.8M rows. If that's not right, the data are not being read in correctly. The main problem I see is that your machine seems unable to allocate more than about 450Mb to R, and it has surprisingly little swap space. (This 512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb to R when needed.)
2) Ultimately I would like to replace Stata with R, but the Stata datasets I frequently use are in the 100s of megs, which work fine on this machine. Is R capable of this?
Probably not. R does require objects to be stored in memory. As a serious statistical question: what can you usefully do with 8M rows on 9 continuous variables? Why would a 1% sample not be already far more than enough? My group regularly works with datasets in the 100s of Mb, but normally we either sample or we summarize in groups for further analysis. Our latest dataset is a 1.2Gb Oracle table, but it has structure (it's 60 experiments for a start). [...] BTW, rbind is inefficient, but adding a piece at time is the least efficient way to use it. rbind(full1, full2, ..., full10) would be better. Allocating full and assigning to sub-sections would be better still.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._