large dataset

Tue, Jan 30, 2007 4:38 PM

I should have mentioned this in my earlier reply - please use version
0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
version to CRAN and it should appear on the main site and the mirrors
in a day or two.  You can get it now from the SVN archive

https://svn.r-project.org/R-packages/trunk/lme4

In an earlier thread on this list Andrew Robinson described how he was
unable to run even the simplest examples of lmer2 on a FreeBSD system.
 With his help we finally tracked down the dumb error that I had made
in the C function mer2_getPars and fixed it for the -12 release.
Under Linux the bug was not causing a memory error but it certainly
would use up much more memory than necessary during the iterations.

On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

I'm attempting to fit a crossed random effects model to a rather large
data set.  This is EU parliament voting data (the response variable is
binary) from 574 legislators over 2123 votes.  EU parliamentarians
miss a lot of votes so there are ~700,000 total observations.  The
model also includes quite a few covariates---on the order of 30-50
(mostly fixed effects for country, party, etc), depending on the
particular specification.  I'm having some serious issues fitting a
crossed effects logit model to this data with lme4 without exhausting
system memory.  I have a quad-core intel linux machine with 8 gigs of
ram and a lot of swap to play with, but I'm still falling short.
Interestingly, I've successfully fit this model using HLM6 on a
machine with substantially less RAM.

My question is largely about feasibility.  I would like to use lme4 to
analyze this dataset because it provides a much better set of features
for checking model fit and generating predictions than HLM (one can't
even get the fixed effects variance-covariance matrix out of HLM6's
crossed effects routine).  Is this impossible?  Are there any ways to
reduce lmer's memory footprint that I might try?  Would one expect a
cross-classified logit model with 700,000 observations to require
upwards of 12 gigs of memory or have I uncovered a small memory leak
that isn't visible with smaller datasets?  The memory use creeps up
slowly over the course of a run which is at least consistent with a
memory leak, but, not knowing anything about the implementation, I'm
just speculating wildly here.  Obviously, I could sub-sample, but this
is already a sample of a larger dataset, so I'm loathe to do that if I
can avoid it.

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.

large dataset

Thread (5 messages)