Hi,
I have a few questions about how to handle large data sets in R.
What is the size of the largest matrix that R can comfortably deal with?
Is this size limit imposed by R's software, or is it a question
of the machine that one runs on?
How does one go about choosing reasonable values of vsize
and nsize?
I have a data set with about 1,000,000 rows, and 30 columns (1 character, 29
numbers),
stored in a flat file.
When I run Splus-5 on a Solaris workstation I can read this file quite easily
myData <- read.table(file = "mydata.dat")
and manipulate the data without any problems.
On the other hand, when I try to do the same on a PC (128 M RAM, 400MHz), running
Linux (Redhat 6.1) , on R version 0.90.0, I find that it is
impossible.
When I allocate (what I believe to be) the maximum amount of vsize
memory and a large amount of nsize memory
R --vsize 200M --nsize 4000k,
and then try to read the file in using read.table() or scan()
myData <- read.table(file = "mydata.dat")
or
myData <- scan(file = "myData.dat", what = list("",0,0,...,0)) (with
29 zeros)
I get kicked out of R.
More worrisome, I did succeed in reading in a subset of the data with 30,000 rows.
However, when I tried to plot one of the columns, my monitor began blinking
wildly, and the machine crashed. I had to reboot.
I tried to read the R help page on memory, but wasn't able to understand much of
what it was saying.
Thanks much for any help,
Jeff Miller
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
size limits
5 messages · Jeff Miller, Peter Dalgaard, Martin Maechler +1 more
Jeff Miller <jdm at xnet.com> writes:
On the other hand, when I try to do the same on a PC (128 M RAM, 400MHz), running
Linux (Redhat 6.1) , on R version 0.90.0, I find that it is
impossible.
When I allocate (what I believe to be) the maximum amount of vsize
memory and a large amount of nsize memory
R --vsize 200M --nsize 4000k,
and then try to read the file in using read.table() or scan()
myData <- read.table(file = "mydata.dat")
or
myData <- scan(file = "myData.dat", what = list("",0,0,...,0)) (with
29 zeros)
I get kicked out of R.
More worrisome, I did succeed in reading in a subset of the data with 30,000 rows.
However, when I tried to plot one of the columns, my monitor began blinking
wildly, and the machine crashed. I had to reboot.
You've probably come too close to the machine capacity there. Linuxen are often run without user limits on process size so if you eat too much memory, some random process will be killed and with a bit of bad luck it will be something critical such as your X server... Notice that 200M vsize + 4000k nodes (20 bytes each) is about 150M more than your physical memory, and with system processes easily taking up 60M you'd need 200M swap to run. a quick calculation suggests that your data alone takes ~240M, which suggests that you really need a bigger machine.
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
{maybe somewhat technical; non R / Linux only}
"PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
PD> ..........
PD> Linuxen are often run without user limits on process size so
PD> if you eat too much memory, some random process will be killed and
PD> with a bit of bad luck it will be something critical such as your X
PD> server...
Setting ulimits on Linux seems to be only through bash's builtin "ulimit"
or through the C API.
Now, we are still "urged" to usually use the tcsh instead of the bash.
How would a linux sys.administrator limit the process size for all normal
user processes? Sorry for my ignorance, but I assume that other R-help
readers are in the same boat...
Martin
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Martin Maechler <maechler at stat.math.ethz.ch> writes:
Setting ulimits on Linux seems to be only through bash's builtin "ulimit" or through the C API. Now, we are still "urged" to usually use the tcsh instead of the bash. How would a linux sys.administrator limit the process size for all normal user processes? Sorry for my ignorance, but I assume that other R-help readers are in the same boat...
I think it goes via a "ulimit -H something" in a startup file.
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Mon, 24 Jan 2000, Martin Maechler wrote:
{maybe somewhat technical; non R / Linux only}
"PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
PD> ..........
PD> Linuxen are often run without user limits on process size so
PD> if you eat too much memory, some random process will be killed and
PD> with a bit of bad luck it will be something critical such as your X
PD> server...
Setting ulimits on Linux seems to be only through bash's builtin "ulimit"
or through the C API.
Now, we are still "urged" to usually use the tcsh instead of the bash.
It is called `limit' on csh/tcsh, according to my systems. (And AFAIK it works on RH6.0.) There is also `unlimit'.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._