Hello R-users,
First my settings: R-1.8.1, compiled as a 64bit application for a Solaris
5.8, 64 bit. The OS has 8Gb of RAM available and I am the sole user of the
machine, hence pretty much all the 8Gb are available to R.
I am pretty new to R and I am having a hard time to work with large data
sets, which make up over 90% of the anlyses done here. The data set I
imported in R, from S+, has a little over 2,000,000 rows by somewhere
around 60 variables, most of them factors, but a few continuous. The data
set is in fact a subset of a larger data set used for analysis in S+. I
know that some of you will think that I should sample, but it is not an
option in the present settings.
After first reading the data set into R -- which had its challenges on its
own -- when I quit R and save the work space it takes over 5 minutes, when
I start a new session and load the data set it takes around 15 minutes.
I am trying to build a model that I have already built in S+, so I can make
sure I am doing the right thing and can compare resources usage, but so far
I have no luck! After 45 minutes or so R has used up all the available
memory and is swapping, which brings CPU usage close to nothing.
I am convinced there are settings I could use to optimize memory management
for such problems. I tried help(Memory) which tells me about the options "
--min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu", but it is not
clear if they should be used and when. Further down the pages it
says:"..., and since setting larger values of the minima will make R
slightly more efficient on large tasks." But on the other hand, searching
the R-site, for memory management clues I found, from Brian Ripley, dated
13 Nov. 2003:
"But had you actually read the documentation you would know it did not do
that. That needs --max-memory-size set.", that was in response to someone
who had increased the value of "min-vsize= "; furthermore I don't find any
"--max-memory-size" option?
I am wondering if someone having experience working with large data sets
would share the configurations and options he is using. If that matters
here is the model I was trying to fit.
library(package = "statmod", pos = 2,
lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")
qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
categveh + champion + cie + dossiera +
faq13c + faq5a + kmaff + kmprom + nbvt +
rabprof + sexeprin + newage,
family = tweedie(var.power = 1.577,
link.power = 0),
etastart = log(rep(mean(qc.b3.sans.occ[,
'pp20B3']), nrow(qc.b3.sans.occ))),
weights = unsb3t1,
trace = T,
data = qc.b3.sans.occ)
After one iteration (45+ minutes) R is trashing through over 10Gb of
memory.
Thanks for any insights,
G?rald Jean
Analyste-conseil (statistiques), Actuariat
t?lephone : (418) 835-4900 poste (7639)
t?lecopieur : (418) 835-6657
courrier ?lectronique: gerald.jean at spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming
Large data sets and memory management in R.
2 messages · gerald.jean@dgag.ca, Peter Dalgaard
gerald.jean at dgag.ca writes:
library(package = "statmod", pos = 2,
lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")
qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
categveh + champion + cie + dossiera +
faq13c + faq5a + kmaff + kmprom + nbvt +
rabprof + sexeprin + newage,
family = tweedie(var.power = 1.577,
link.power = 0),
etastart = log(rep(mean(qc.b3.sans.occ[,
'pp20B3']), nrow(qc.b3.sans.occ))),
weights = unsb3t1,
trace = T,
data = qc.b3.sans.occ)
After one iteration (45+ minutes) R is trashing through over 10Gb of
memory.
Thanks for any insights,
Well, I don't know how much it helps; you are in somewhat uncharted
territory there. I suppose the dataset comes to 0.5-1GB all by itself?
One thing that I note is that you have 60 variables, but use only 15.
Perhaps it helps to remove some of them before the run?
How large does the designmatrix get? If some of those variables have a
lot of levels, it could explain the phenomenon. Any chance that a
continuous variable got recorded as a factor?
-p
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907