I've seen several posts over the past 2-3 weeks about memory issues. I've
tried to carefully follow the suggestions, but remain baffled as to why I
can't load data into R. I hope that in revisiting this issue that I don't
exasperate the list.
The setting:
1 gig RAM , Linux machine
10 Stata files of approximately 14megs each
File contents appear at the end of this boorishly long email.
Purpose:
load and combine in R for further analysis
Question:
1) I've placed memory queries in the command file to see what is going on.
It appears that loading a 14meg file consumes approx 5 times this amount of
memory - i.e. available memory declines by 70megs when a 14 meg dataset is
loaded. (Seen in Method 2 below)
2) Ultimately I would like to replace Stata with R, but the Stata datasets
I frequently use are in the 100s of megs, which work fine on this machine.
Is R capable of this?
The command files:
I've attempted the process in to ways (each time as regular user
(ulimit=unlimited; and as root on the system to avoid OS restrictions).
The first method is as follows:
METHOD ONE
R --no-save --max-vsize=800M < QuickLook.R > QuickLook.log
======== QuickLook.log follows ================
library(foreign)
a <- Sys.time()
full <- read.dta('../off/off10yr1.dta')
gc()
used (Mb) gc trigger (Mb) limit (Mb)
Ncells 1018821 27.3 1166886 31.2 NA
Vcells 4456284 34.0 5070089 38.7 800
Error: cannot allocate vector of size 1645 Kb
Execution halted
THIRD METHOD
I combined the the stata files in stata (same machine) and saved them as a
single file thinking there could be an inefficiency with rbind(). Same
error code.
TO ASSURE YOU THAT I AM NOT CRAZY, THE FOLLOWING IS A SAMPLE DIRECTORY
LISTING OF THE FILES OF INTEREST
-rw-r--r-- 1 ctaylor econ 14M Jun 27 16:15 off10yr5.dta
-rw-r--r-- 1 ctaylor econ 14M Jun 27 17:53 off10yr6.dta
-rw-r--r-- 1 ctaylor econ 14M Jun 27 19:30 off10yr7.dta
-rw-r--r-- 1 ctaylor econ 14M Jun 27 21:08 off10yr8.dta
-rw-r--r-- 1 ctaylor econ 14M Jun 27 23:02 off10yr9.dt
DATA CONTENTS (IN TEXT FORM OF COURSE)
head off10yr1.out
scenario metcode yr ginv cons gocc abs dvac gmre gmer
1 "AA" 2001 .04 3384000 .047 3641000 -.006 .025 .028
1 "AA" 2002 .042 3657000 .046 3716000 -.004 .034 .035
1 "AA" 2003 .031 2816000 .047 3972000 -.015 .051 .056
1 "AA" 2004 .035 3271000 .046 4064000 -.01 .075 .078
1 "AA" 2005 .037 3636000 .037 3444000 0 .084 .084
1 "AA" 2006 .041 4183000 .035 3315000 .006 .118 .116
1 "AA" 2007 .043 4513000 .019 1915000 .021 .094 .086
1 "AA" 2008 .039 4320000 .034 3431000 .005 .068 .066
1 "AA" 2009 .034 3848000 .05 5262000 -.015 .057 .063
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I've seen several posts over the past 2-3 weeks about memory issues. I've
tried to carefully follow the suggestions, but remain baffled as to why I
can't load data into R. I hope that in revisiting this issue that I don't
exasperate the list.
The setting:
1 gig RAM , Linux machine
10 Stata files of approximately 14megs each
File contents appear at the end of this boorishly long email.
Purpose:
load and combine in R for further analysis
Question:
1) I've placed memory queries in the command file to see what is going on.
It appears that loading a 14meg file consumes approx 5 times this amount of
memory - i.e. available memory declines by 70megs when a 14 meg dataset is
loaded. (Seen in Method 2 below)
That's quite possible. A `14Mb dataset' is not too helpful to us. You
seem to have one char (ca 2 chars) and 9 numeric variables per record.
That's ca 75 bytes per record. An actual experiment and using object.size
gives 88 (there are row names too). So at 70Mb, that is about 0.8M rows.
If that's not right, the data are not being read in correctly.
The main problem I see is that your machine seems unable to allocate more
than about 450Mb to R, and it has surprisingly little swap space. (This
512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb
to R when needed.)
2) Ultimately I would like to replace Stata with R, but the Stata datasets
I frequently use are in the 100s of megs, which work fine on this machine.
Is R capable of this?
Probably not. R does require objects to be stored in memory.
As a serious statistical question: what can you usefully do with 8M rows
on 9 continuous variables? Why would a 1% sample not be already far more
than enough? My group regularly works with datasets in the 100s of Mb,
but normally we either sample or we summarize in groups for further
analysis. Our latest dataset is a 1.2Gb Oracle table, but it has
structure (it's 60 experiments for a start).
[...]
BTW, rbind is inefficient, but adding a piece at time is the least
efficient way to use it. rbind(full1, full2, ..., full10) would be
better. Allocating full and assigning to sub-sections would be better
still.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The main problem I see is that your machine seems unable to allocate more
than about 450Mb to R, and it has surprisingly little swap space. (This
512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb
to R when needed.)
Well, this rises an interesting point for me: are there advices on
how to configure a particular system for best R performance with
large datasets? I've looked into the R system guide and could not find
anything (that document is a bit obscure for me, must recognize).
Do you get the 800 Mb by starting R with a particular option?
(BTW, I do not remember how much swap I defined for my linux
system, how could I check that?. Excuses for the not-R question)
Agus
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
"Agustin" == Agustin Lobo <alobo at ija.csic.es> writes:
Agustin> On Tue, 24 Jul 2001, Prof Brian Ripley wrote:
BDR> The main problem I see is that your machine seems unable to
BDR> allocate more than about 450Mb to R, and it has surprisingly
BDR> little swap space. (This 512Mb Linux machine has 1Gb of swap
BDR> allocated, and happily allocates 800Mb to R when needed.)
Agustin> Well, this rises an interesting point for me: are there
Agustin> advices on how to configure a particular system for best R
Agustin> performance with large datasets? I've looked into the R system
Agustin> guide and could not find anything (that document is a bit
Agustin> obscure for me, must recognize).
which document? There's no ``R system guide''
Agustin> Do you get the 800 Mb by starting R with a particular option?
no; Brian said ``when needed'' which means that R allocates memory (almost
always) when it needs it.
Agustin> (BTW, I do not remember how much swap I defined for my linux
Agustin> system, how could I check that?. Excuses for the not-R question)
cat /proc/meminfo
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO D10 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
2) Ultimately I would like to replace Stata with R, but the Stata
datasets I frequently use are in the 100s of megs, which work fine on
this machine. Is R capable of this?
Brian Ripley replied:
Probably not. R does require objects to be stored in memory.
I think Stata also required that objects reside in memory. It is quite a
time since I have used Stata (I have v4 on my shelves) but I remember
that it was no good for cancer registry work (big datasets) as it needed
everything to be in memory and we only have 16MB Win 3.11 machines at
the time.
There is a storage difference between R and Stata. Stata has several
number types (byte, short integer, long integer, single precision float,
double precision float) whereas R has only the integer (equivalent to
long integers, I think) and real/numeric/double which are all double
precision floats. This means that R will often require more memory to
store objects than Stata. Your "14 MB" file could easily swell to many
times that size if Stata 'byte' types are being stored as double
precision numbers.
I am not aware if there are plans to add different storage modes to R
but doing so might be useful particularly with large datasets.
Just my tuppence.
Mark
--
Mark Myatt
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Agustin> Well, this rises an interesting point for me: are there
Agustin> advices on how to configure a particular system for best R
Agustin> performance with large datasets? I've looked into the R system
Agustin> guide and could not find anything (that document is a bit
Agustin> obscure for me, must recognize).
which document? There's no ``R system guide''
I mean R-admin.pdf
Agustin> Do you get the 800 Mb by starting R with a particular option?
no; Brian said ``when needed'' which means that R allocates memory (almost
always) when it needs it.
ok, so I understand that R allocates large amounts of memory with no need
of
starting option, with the only limit of the free swap space.
Thanks
Agus
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martin Maechler <maechler at stat.math.ethz.ch> writes:
Agustin> (BTW, I do not remember how much swap I defined for my linux
Agustin> system, how could I check that?. Excuses for the not-R question)
cat /proc/meminfo
free
is somewhat less verbose and some more specific details on swap are in
cat /proc/swaps
or (apparently a n alias of the above)
/sbin/swapon -s
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The main problem I see is that your machine seems unable to allocate more
than about 450Mb to R, and it has surprisingly little swap space. (This
512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb
to R when needed.)
Well, this rises an interesting point for me: are there advices on
how to configure a particular system for best R performance with
large datasets? I've looked into the R system guide and could not find
anything (that document is a bit obscure for me, must recognize).
Do you get the 800 Mb by starting R with a particular option?
No, just the standard options. Basically, under Unix/Linux
1) Make sure the ulimit/limit settings are suitable (look up your shell
documentation).
2) Make sure you do have ample swap space configured: disc space is
really cheap, and with the current non-moving-objects garbage collector,
currently unused large objects can be successfully swapped out. (That was
not true before 1.2.0.)
3) Start R without any options.
So there is no advice, as nothing special is needed.
On *Windows* there is an equivalent of ulimit/limit set and you are likely
to be less successful in running large R workspaces. In some far as I
understand it, this applies to the classic Mac port too.
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Within a program, I subset a data matrix. Sometimes
it comes to remain one single row (indivual) out of the
subseting. The fact that the single row becomes
a vector with dim(x)=NULL instead of a matrix
with dim(x)=c(1,n), is inconvenient for further operations
in the program.
I thought that drop=F would solve this problem but...
lets a be:
NULL
No way to get dim(a[a[,1]<1,]) equal to c(1,3) ?
Thanks
Agus
Dr. Agustin Lobo
Instituto de Ciencias de la Tierra (CSIC)
Lluis Sole Sabaris s/n
08028 Barcelona SPAIN
tel 34 93409 5410
fax 34 93411 0012
alobo at ija.csic.es
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Within a program, I subset a data matrix. Sometimes
it comes to remain one single row (indivual) out of the
subseting. The fact that the single row becomes
a vector with dim(x)=NULL instead of a matrix
with dim(x)=c(1,n), is inconvenient for further operations
in the program.
I thought that drop=F would solve this problem but...
Be careful. It's FALSE not F. Indeed it solves the problem, but only if
used as specified on the help page....
NULL
No way to get dim(a[a[,1]<1,]) equal to c(1,3) ?
No way, *but* if you use this correctly you will get what you want.
Read ?Extract carefully. You have assumed that
x[i, j, ... , drop=TRUE]
can be contracted, but it does not say so (nor can it in S).
Try
a[a[,1]<1,, drop=FALSE]
[,1] [,2] [,3]
[1,] 0.3249816 1.184596 1.040875
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
[1] 1 3
Paul Gilbert
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._