Large data sets and memory management in R. - R-help

Wed, Jan 28, 2004 12:11 PM #
Hello R-users,

First my settings: R-1.8.1, compiled as a 64bit application for a Solaris
5.8, 64 bit.  The OS has 8Gb of RAM available and I am the sole user of the
machine, hence pretty much all the 8Gb are available to R.

I am pretty new to R and I am having a hard time to work with large data
sets, which make up over 90% of the anlyses done here.  The data set I
imported in R, from S+, has a little over 2,000,000 rows by somewhere
around 60 variables, most of them factors, but a few continuous.  The data
set is in fact a subset of a larger data set used for analysis in S+.  I
know that some of you will think that I should sample, but it is not an
option in the present settings.

After first reading the data set into R -- which had its challenges on its
own -- when I quit R and save the work space it takes over 5 minutes, when
I start a new session and load the data set it takes around 15 minutes.

I am trying to build a model that I have already built in S+, so I can make
sure I am doing the right thing and can compare resources usage, but so far
I have no luck!  After 45 minutes or so R has used up all the available
memory and is swapping, which brings CPU usage close to nothing.

I am convinced there are settings I could use to optimize memory management
for such problems.  I tried help(Memory) which tells me about the options "
--min-vsize=vl --max-vsize=vu --min-nsize=nl --max-nsize=nu", but it is not
clear if they should be used and when.  Further down the pages it
says:"..., and since setting larger values of the minima will make R
slightly more efficient on large tasks."  But on the other hand, searching
the R-site, for memory management clues I found, from Brian Ripley, dated
13 Nov. 2003:
"But had you actually read the documentation you would know it did not do
that. That needs --max-memory-size set.", that was in response to someone
who had increased the value of "min-vsize= "; furthermore I don't find any
"--max-memory-size" option?

I am wondering if someone having experience working with large data sets
would share the configurations and options he is using.  If that matters
here is the model I was trying to fit.

library(package = "statmod", pos = 2,
        lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")

qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
                     categveh + champion + cie + dossiera +
                     faq13c + faq5a + kmaff + kmprom + nbvt +
                     rabprof + sexeprin + newage,
                     family = tweedie(var.power = 1.577,
                       link.power = 0),
                     etastart = log(rep(mean(qc.b3.sans.occ[,
                        'pp20B3']), nrow(qc.b3.sans.occ))),
                     weights = unsb3t1,
                     trace = T,
                     data = qc.b3.sans.occ)

After one iteration (45+ minutes) R is trashing through over 10Gb of
memory.

Thanks for any insights,

G?rald Jean
Analyste-conseil (statistiques), Actuariat
t?lephone            : (418) 835-4900 poste (7639)
t?lecopieur          : (418) 835-6657
courrier ?lectronique: gerald.jean at spgdag.ca

"In God we trust all others must bring data"  W. Edwards Deming