Skip to content

Memory/data -last time I promise

11 messages · Michaell Taylor, Brian Ripley, Martin Maechler +4 more

#
I've seen several posts over the past 2-3 weeks about memory issues.  I've
tried to carefully follow the suggestions, but remain baffled as to why I
can't load data into R.  I hope that in revisiting this issue that I don't
exasperate the list.

The setting: 
1 gig RAM , Linux machine
10 Stata files of approximately 14megs each
File contents appear at the end of this boorishly long email.

Purpose: 
load and combine in R for further analysis

Question:

1) I've placed memory queries in the command file to see what is going on. 
It appears that loading a 14meg file consumes approx 5 times this amount of
memory - i.e. available memory declines by 70megs when a 14 meg dataset is
loaded. (Seen in Method 2 below)
2) Ultimately I would like to replace Stata with R, but the Stata datasets
I frequently use are in the 100s of megs, which work fine on this machine.
Is R capable of this?


The command files:

I've attempted the process in to ways (each time as regular user
(ulimit=unlimited; and as root on the system to avoid OS restrictions). 
The first method is as follows:

METHOD ONE

R --no-save --max-vsize=800M < QuickLook.R > QuickLook.log
========   QuickLook.log follows ================
used (Mb) gc trigger (Mb) limit (Mb)
Ncells 1018821 27.3    1166886 31.2         NA
Vcells 4456284 34.0    5070089 38.7        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 696303616 376999936        0 21487616 36982784
Swap: 271392768 263294976  8097792
MemTotal:   1048148 kB
MemFree:     368164 kB
MemShared:        0 kB
Buffers:      20984 kB
Cached:       36116 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:      7908 kB
+       fname1 <- paste('../off/off10yr',n,'.dta', sep="")
+       full <- rbind(read.dta(fname1),  full)
+       gc()
+       system('cat /proc/meminfo')
+       n
+       n <- n+1}
        total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 780275712 293027840        0 21487616 51609600
Swap: 271392768 263294976  8097792
MemTotal:   1048148 kB
MemFree:     286160 kB
MemShared:        0 kB
Buffers:      20984 kB
Cached:       50400 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:      7908 kB
Error: cannot allocate vector of size 3291 Kb
Execution halted



SECOND METHOD
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 637681664 435621888        0 21753856 31592448
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:     425412 kB
MemShared:        0 kB
Buffers:      21244 kB
Cached:       30852 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used (Mb) gc trigger (Mb) limit (Mb)
Ncells 1018825 27.3    1166886 31.2         NA
Vcells 4456285 34.0    5070086 38.7        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 707162112 366141440        0 21757952 45498368
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:     357560 kB
MemShared:        0 kB
Buffers:      21248 kB
Cached:       44432 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used (Mb) gc trigger (Mb) limit (Mb)
Ncells 1861390 49.8    2105982 56.3         NA
Vcells 8879476 67.8    9315972 71.1        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 777375744 295927808        0 21757952 59826176
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:     288992 kB
MemShared:        0 kB
Buffers:      21248 kB
Cached:       58424 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used  (Mb) gc trigger  (Mb) limit (Mb)
Ncells  2703952  72.3    3708127  99.1         NA
Vcells 13302667 101.5   14190661 108.3        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 847650816 225652736        0 21757952 74153984
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:     220364 kB
MemShared:        0 kB
Buffers:      21248 kB
Cached:       72416 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used  (Mb) gc trigger  (Mb) limit (Mb)
Ncells  3546514  94.8    4953636 132.3         NA
Vcells 17725858 135.3   18735437 143.0        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 917798912 155504640        0 21762048 88481792
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:     151860 kB
MemShared:        0 kB
Buffers:      21252 kB
Cached:       86408 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used  (Mb) gc trigger  (Mb) limit (Mb)
Ncells  4389076 117.3    6193578 165.4         NA
Vcells 22149049 169.0   23279670 177.7        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 988033024 85270528        0 21770240 102809600
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:      83272 kB
MemShared:        0 kB
Buffers:      21260 kB
Cached:      100400 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used  (Mb) gc trigger  (Mb) limit (Mb)
Ncells  5231638 139.7    7700734 205.7         NA
Vcells 26572240 202.8   27312192 208.4        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 1058263040 15040512        0 21774336 117137408
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:      14688 kB
MemShared:        0 kB
Buffers:      21264 kB
Cached:      114392 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
used  (Mb) gc trigger  (Mb) limit (Mb)
Ncells  6074200 162.2    8572058 228.9         NA
Vcells 30995431 236.5   31726362 242.1        800
total:    used:    free:  shared: buffers:  cached:
Mem:  1073303552 1069006848  4296704        0 21471232 72318976
Swap: 271392768 261148672 10244096
MemTotal:   1048148 kB
MemFree:       4196 kB
MemShared:        0 kB
Buffers:      20968 kB
Cached:       70624 kB
BigTotal:    131064 kB
BigFree:          0 kB
SwapTotal:   265032 kB
SwapFree:     10004 kB
Error: cannot allocate vector of size 1645 Kb
Execution halted


THIRD METHOD

I combined the the stata files in stata (same machine) and saved them as a
single file thinking there could be an inefficiency with  rbind(). Same
error code.


TO ASSURE YOU THAT I AM NOT CRAZY, THE FOLLOWING IS A SAMPLE DIRECTORY
LISTING OF THE FILES OF INTEREST

-rw-r--r--    1 ctaylor  econ          14M Jun 27 16:15 off10yr5.dta
-rw-r--r--    1 ctaylor  econ          14M Jun 27 17:53 off10yr6.dta
-rw-r--r--    1 ctaylor  econ          14M Jun 27 19:30 off10yr7.dta
-rw-r--r--    1 ctaylor  econ          14M Jun 27 21:08 off10yr8.dta
-rw-r--r--    1 ctaylor  econ          14M Jun 27 23:02 off10yr9.dt


DATA CONTENTS (IN TEXT FORM OF COURSE)

head off10yr1.out
scenario        metcode yr      ginv    cons    gocc    abs     dvac    gmre    gmer
1       "AA"    2001    .04     3384000 .047    3641000 -.006   .025    .028
1       "AA"    2002    .042    3657000 .046    3716000 -.004   .034    .035
1       "AA"    2003    .031    2816000 .047    3972000 -.015   .051    .056
1       "AA"    2004    .035    3271000 .046    4064000 -.01    .075    .078
1       "AA"    2005    .037    3636000 .037    3444000 0       .084    .084
1       "AA"    2006    .041    4183000 .035    3315000 .006    .118    .116
1       "AA"    2007    .043    4513000 .019    1915000 .021    .094    .086
1       "AA"    2008    .039    4320000 .034    3431000 .005    .068    .066
1       "AA"    2009    .034    3848000 .05     5262000 -.015   .057    .063




-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Tue, 24 Jul 2001, Micheall Taylor wrote:

            
That's quite possible.  A `14Mb dataset' is not too helpful to us.  You
seem to have one char (ca 2 chars) and 9 numeric variables per record.
That's ca 75 bytes per record.  An actual experiment and using object.size
gives 88 (there are row names too).  So at 70Mb, that is about 0.8M rows.
If that's not right, the data are not being read in correctly.

The main problem I see is that your machine seems unable to allocate more
than about 450Mb to R, and it has surprisingly little swap space.  (This
512Mb Linux machine has 1Gb of swap allocated, and happily allocates 800Mb
to R when needed.)
Probably not.  R does require objects to be stored in memory.

As a serious statistical question: what can you usefully do with 8M rows
on 9 continuous variables?  Why would a 1% sample not be already far more
than enough?  My group regularly works with datasets in the 100s of Mb,
but normally we either sample or we summarize in groups for further
analysis.  Our latest dataset is a 1.2Gb Oracle table, but it has
structure (it's 60 experiments for a start).

[...]

BTW, rbind is inefficient, but adding a piece at time is the least
efficient way to use it.  rbind(full1, full2, ..., full10) would be
better.  Allocating full and assigning to sub-sections would be better
still.
#
On Tue, 24 Jul 2001, Prof Brian Ripley wrote:

            
Well, this rises an interesting point for me: are there advices on
how to configure a particular system for best R performance with
large datasets? I've looked into the R system guide and could not find
anything (that document is a bit obscure for me, must recognize).
Do you get the 800 Mb by starting R with a particular option?

(BTW, I do not remember how much swap I defined for my linux
system, how could I check that?. Excuses for the not-R question)
 
Agus

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#

        
Agustin> On Tue, 24 Jul 2001, Prof Brian Ripley wrote:
BDR> The main problem I see is that your machine seems unable to
    BDR> allocate more than about 450Mb to R, and it has surprisingly
    BDR> little swap space.  (This 512Mb Linux machine has 1Gb of swap
    BDR> allocated, and happily allocates 800Mb to R when needed.)


    Agustin> Well, this rises an interesting point for me: are there
    Agustin> advices on how to configure a particular system for best R
    Agustin> performance with large datasets? I've looked into the R system
    Agustin> guide and could not find anything (that document is a bit
    Agustin> obscure for me, must recognize).

which document?  There's no  ``R system guide''

    Agustin> Do you get the 800 Mb by starting R with a particular option?

no; Brian said ``when needed'' which means that R allocates memory (almost
always) when it needs it.


    Agustin> (BTW, I do not remember how much swap I defined for my linux
    Agustin> system, how could I check that?. Excuses for the not-R question)
 
cat /proc/meminfo

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO D10	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Michael Taylor wrote:

            
Brian Ripley replied:
I think Stata also required that objects reside in memory. It is quite a
time since I have used Stata (I have v4 on my shelves) but I remember
that it was no good for cancer registry work (big datasets) as it needed
everything to be in memory and we only have 16MB Win 3.11 machines at
the time.

There is a storage difference between R and Stata. Stata has several
number types (byte, short integer, long integer, single precision float,
double precision float) whereas R has only the integer (equivalent to
long integers, I think) and real/numeric/double which are all double
precision floats. This means that R will often require more memory to
store objects than Stata. Your "14 MB" file could easily swell to many
times that size if Stata 'byte' types are being stored as double
precision numbers.

I am not aware if there are plans to add different storage modes to R
but doing so might be useful particularly with large datasets.

Just my tuppence.

Mark


--
Mark Myatt


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Wed, 25 Jul 2001, Martin Maechler wrote:

            
I mean R-admin.pdf
ok, so I understand that R allocates large amounts of memory with no need
of
starting option, with the only limit of the free swap space.

Thanks

Agus



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Martin Maechler <maechler at stat.math.ethz.ch> writes:
free

is somewhat less verbose and some more specific details on swap are in

cat /proc/swaps

or (apparently a n alias of the above)

/sbin/swapon -s
#
On Wed, 25 Jul 2001, Agustin Lobo wrote:

            
No, just the standard options.  Basically, under Unix/Linux

1) Make sure the ulimit/limit settings are suitable (look up your shell
documentation).

2) Make sure you do have ample swap space configured: disc space is
really cheap, and with the current non-moving-objects garbage collector,
currently unused large objects can be successfully swapped out.  (That was
not true before 1.2.0.)

3) Start R without any options.

So there is no advice, as nothing special is needed.

On *Windows*  there is an equivalent of ulimit/limit set and you are likely
to be less successful in running large R workspaces.  In some far as I
understand it, this applies to the classic Mac port too.
#
Within a program, I subset a data matrix. Sometimes
it comes to remain one single row (indivual) out of the
subseting. The fact that the single row becomes
a vector with dim(x)=NULL instead of a matrix
with dim(x)=c(1,n), is inconvenient for further operations
in the program.

I thought that drop=F would solve this problem but...

lets a be:
[,1]     [,2]      [,3]
[1,] 0.3249816 1.184596 1.0408749
[2,] 1.4722996 1.408512 0.3768964
[3,] 1.2737683 1.811588 1.9108336
[4,] 1.8235127 1.260909 1.5995097

Then
[1] 0.3249816 1.1845962 1.0408749
NUL

But,
[1] 0.3249816 1.1845962 1.0408749
NULL

No way to get dim(a[a[,1]<1,]) equal to c(1,3) ?

Thanks

Agus


Dr. Agustin Lobo
Instituto de Ciencias de la Tierra (CSIC)
Lluis Sole Sabaris s/n
08028 Barcelona SPAIN
tel 34 93409 5410
fax 34 93411 0012
alobo at ija.csic.es


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Wed, 25 Jul 2001, Agustin Lobo wrote:

            
Be careful.  It's FALSE not F.  Indeed it solves the problem, but only if
used as specified on the help page....
No way, *but* if you use this correctly you will get what you want.

Read ?Extract carefully.  You have assumed that

     x[i, j, ... , drop=TRUE]

can be contracted, but it does not say so (nor can it in S).

Try
[,1]     [,2]     [,3]
[1,] 0.3249816 1.184596 1.040875
#
You forgot a comma:
[1] 1 3

Paul Gilbert

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._