large dataset - R-SIG-mixed-models

Tue, Jan 30, 2007 3:16 PM #

Hi all,

I'm attempting to fit a crossed random effects model to a rather large
data set.  This is EU parliament voting data (the response variable is
binary) from 574 legislators over 2123 votes.  EU parliamentarians
miss a lot of votes so there are ~700,000 total observations.  The
model also includes quite a few covariates---on the order of 30-50
(mostly fixed effects for country, party, etc), depending on the
particular specification.  I'm having some serious issues fitting a
crossed effects logit model to this data with lme4 without exhausting
system memory.  I have a quad-core intel linux machine with 8 gigs of
ram and a lot of swap to play with, but I'm still falling short.
Interestingly, I've successfully fit this model using HLM6 on a
machine with substantially less RAM.

My question is largely about feasibility.  I would like to use lme4 to
analyze this dataset because it provides a much better set of features
for checking model fit and generating predictions than HLM (one can't
even get the fixed effects variance-covariance matrix out of HLM6's
crossed effects routine).  Is this impossible?  Are there any ways to
reduce lmer's memory footprint that I might try?  Would one expect a
cross-classified logit model with 700,000 observations to require
upwards of 12 gigs of memory or have I uncovered a small memory leak
that isn't visible with smaller datasets?  The memory use creeps up
slowly over the course of a run which is at least consistent with a
memory leak, but, not knowing anything about the implementation, I'm
just speculating wildly here.  Obviously, I could sub-sample, but this
is already a sample of a larger dataset, so I'm loathe to do that if I
can avoid it.

thanks,

Dan

Daniel Pemstein
Department of Political Science
University of Illinois at Urbana-Champaign
702 S. Wright St.
Urbana, IL 61801

Email: dbp at uiuc.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20070130/10955047/attachment.bin>

Douglas Bates

Tue, Jan 30, 2007 4:25 PM #

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.
-------------- next part --------------

[1] 16820760

Gp    fixef       nc deviance     dims       ST   cnames     call 
      72      176      384      528      560     1072     1928     3744 
   terms    ranef  weights   offset    flist    frame     ZXyt        A 
    6168   184024   196664   196664   978952  3191264  3547072  3594864 
       L 
 4914248

Douglas Bates

Tue, Jan 30, 2007 4:38 PM #

I should have mentioned this in my earlier reply - please use version
0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
version to CRAN and it should appear on the main site and the mirrors
in a day or two.  You can get it now from the SVN archive

https://svn.r-project.org/R-packages/trunk/lme4

In an earlier thread on this list Andrew Robinson described how he was
unable to run even the simplest examples of lmer2 on a FreeBSD system.
 With his help we finally tracked down the dumb error that I had made
in the C function mer2_getPars and fixed it for the -12 release.
Under Linux the bug was not causing a memory error but it certainly
would use up much more memory than necessary during the iterations.

On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

I'm attempting to fit a crossed random effects model to a rather large
data set.  This is EU parliament voting data (the response variable is
binary) from 574 legislators over 2123 votes.  EU parliamentarians
miss a lot of votes so there are ~700,000 total observations.  The
model also includes quite a few covariates---on the order of 30-50
(mostly fixed effects for country, party, etc), depending on the
particular specification.  I'm having some serious issues fitting a
crossed effects logit model to this data with lme4 without exhausting
system memory.  I have a quad-core intel linux machine with 8 gigs of
ram and a lot of swap to play with, but I'm still falling short.
Interestingly, I've successfully fit this model using HLM6 on a
machine with substantially less RAM.

My question is largely about feasibility.  I would like to use lme4 to
analyze this dataset because it provides a much better set of features
for checking model fit and generating predictions than HLM (one can't
even get the fixed effects variance-covariance matrix out of HLM6's
crossed effects routine).  Is this impossible?  Are there any ways to
reduce lmer's memory footprint that I might try?  Would one expect a
cross-classified logit model with 700,000 observations to require
upwards of 12 gigs of memory or have I uncovered a small memory leak
that isn't visible with smaller datasets?  The memory use creeps up
slowly over the course of a run which is at least consistent with a
memory leak, but, not knowing anything about the implementation, I'm
just speculating wildly here.  Obviously, I could sub-sample, but this
is already a sample of a larger dataset, so I'm loathe to do that if I
can avoid it.

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.

Dan Pemstein

Wed, Jan 31, 2007 1:03 PM #

Thanks for replying so quickly.

It may take me a couple of days to get the new version of lme4
installed and to run the tests you're interested in.  In the
mean-time, I ran (using lmer in 0.9975-11):

A fixed intercept + crossed random intercepts only model 
  - Ran out of memory + swap.
A votes intercept only model with all the covariates
  - Completed.  Topped out at around 6 gigs of memory and this topping
    out occurred at the end of the run, after verbose iteration output
    had completed.  I'm not sure if my earlier full model runs crashed
    at this point as well, but it is a distinct possibility.  

Both these runs were fit using PQL.  One of my full runs used laplace
and ran out of memory after 20-odd iterations and 12+ hours of
processor time.

Here are the sizes for the single random intercept model:

[1] 748925256

Gp        nc  deviance    status  gradComp   devComp       Xty       rXy
       48       256       320       384       496       728       960       960
    fixef     Omega      call    cnames       Zty       rZy     ranef     terms
      960      3552      6528      7776     17032     17032     17032     17096
     bVar    family       ZtZ         L       XtX       RXX       ZtX       RZX
    17456     31448     35616     69880    121824    121880   1962544   1962544
   RZXinv     flist         y       wts    wrkres        Zt   weights     frame
  1962544   2602096   4965032   4965032   4965032   9931408  39720128  69657216
        X
605738248

I'll post the results of an lmer2 run to the list once I have a
chance.

On Tue, Jan 30, 2007 at 06:38:09PM -0600, Douglas Bates wrote:

I should have mentioned this in my earlier reply - please use version
0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
version to CRAN and it should appear on the main site and the mirrors
in a day or two.  You can get it now from the SVN archive

https://svn.r-project.org/R-packages/trunk/lme4

In an earlier thread on this list Andrew Robinson described how he was
unable to run even the simplest examples of lmer2 on a FreeBSD system.
With his help we finally tracked down the dumb error that I had made
in the C function mer2_getPars and fixed it for the -12 release.
Under Linux the bug was not causing a memory error but it certainly
would use up much more memory than necessary during the iterations.


On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

I'm attempting to fit a crossed random effects model to a rather large
data set.  This is EU parliament voting data (the response variable is
binary) from 574 legislators over 2123 votes.  EU parliamentarians
miss a lot of votes so there are ~700,000 total observations.  The
model also includes quite a few covariates---on the order of 30-50
(mostly fixed effects for country, party, etc), depending on the
particular specification.  I'm having some serious issues fitting a
crossed effects logit model to this data with lme4 without exhausting
system memory.  I have a quad-core intel linux machine with 8 gigs of
ram and a lot of swap to play with, but I'm still falling short.
Interestingly, I've successfully fit this model using HLM6 on a
machine with substantially less RAM.

My question is largely about feasibility.  I would like to use lme4 to
analyze this dataset because it provides a much better set of features
for checking model fit and generating predictions than HLM (one can't
even get the fixed effects variance-covariance matrix out of HLM6's
crossed effects routine).  Is this impossible?  Are there any ways to
reduce lmer's memory footprint that I might try?  Would one expect a
cross-classified logit model with 700,000 observations to require
upwards of 12 gigs of memory or have I uncovered a small memory leak
that isn't visible with smaller datasets?  The memory use creeps up
slowly over the course of a run which is at least consistent with a
memory leak, but, not knowing anything about the implementation, I'm
just speculating wildly here.  Obviously, I could sub-sample, but this
is already a sample of a larger dataset, so I'm loathe to do that if I
can avoid it.

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.

Daniel Pemstein
Department of Political Science
University of Illinois at Urbana-Champaign
702 S. Wright St.
Urbana, IL 61801

Email: dbp at uiuc.edu
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20070131/2b972e1a/attachment.bin>

Douglas Bates

Wed, Jan 31, 2007 2:02 PM #

On 1/31/07, Dan Pemstein <dbp at uiuc.edu> wrote:

Thanks.  That result by itself can tell us where the problem lies.
Notice that the size of the X slot is about 600MB out of the total of
about 750 MB.  By comparison, the other slots like XtX, ZtZ and ZtX
are much smaller.

If you still have that model fit available could you check

library(Matrix)
object.size(as(mod1$X, "sparseMatrix"))

The good news from this example is that it gives us hope for fitting
mixed models to large data sets.  The bad news is that doing so will
require a considerable amount of development.

On Tue, Jan 30, 2007 at 06:38:09PM -0600, Douglas Bates wrote:

I should have mentioned this in my earlier reply - please use version
0.9975-12 of lme4 when checking with lmer2.  I just uploaded this
version to CRAN and it should appear on the main site and the mirrors
in a day or two.  You can get it now from the SVN archive

https://svn.r-project.org/R-packages/trunk/lme4

In an earlier thread on this list Andrew Robinson described how he was
unable to run even the simplest examples of lmer2 on a FreeBSD system.
With his help we finally tracked down the dumb error that I had made
in the C function mer2_getPars and fixed it for the -12 release.
Under Linux the bug was not causing a memory error but it certainly
would use up much more memory than necessary during the iterations.


On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:

On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:

I'm attempting to fit a crossed random effects model to a rather large
data set.  This is EU parliament voting data (the response variable is
binary) from 574 legislators over 2123 votes.  EU parliamentarians
miss a lot of votes so there are ~700,000 total observations.  The
model also includes quite a few covariates---on the order of 30-50
(mostly fixed effects for country, party, etc), depending on the
particular specification.  I'm having some serious issues fitting a
crossed effects logit model to this data with lme4 without exhausting
system memory.  I have a quad-core intel linux machine with 8 gigs of
ram and a lot of swap to play with, but I'm still falling short.
Interestingly, I've successfully fit this model using HLM6 on a
machine with substantially less RAM.

My question is largely about feasibility.  I would like to use lme4 to
analyze this dataset because it provides a much better set of features
for checking model fit and generating predictions than HLM (one can't
even get the fixed effects variance-covariance matrix out of HLM6's
crossed effects routine).  Is this impossible?  Are there any ways to
reduce lmer's memory footprint that I might try?  Would one expect a
cross-classified logit model with 700,000 observations to require
upwards of 12 gigs of memory or have I uncovered a small memory leak
that isn't visible with smaller datasets?  The memory use creeps up
slowly over the course of a run which is at least consistent with a
memory leak, but, not knowing anything about the implementation, I'm
just speculating wildly here.  Obviously, I could sub-sample, but this
is already a sample of a larger dataset, so I'm loathe to do that if I
can avoid it.

Could you try to fit the response with a linear mixed model using the
lmer2 function that is in versions 0.9975-11 and later of the lme4
package?  I know the model is inappropriate but I just want to get a
handle on whether the mer2 representation saves enough storage to make
working with such a data set and model feasible.

I shouldn't speculate without actually examining the model fit myself
but I think the memory hog may be the fixed-effects model matrix.
Currently that model matrix must be created as  a dense matrix and it
must be created using all the rows.  When you say that you have 30-50
covariates (and I assume that some of them may be factors) then that
matrix could be the one that is breaking the bank.  In lmer2 the
fixed-effects model matrix is stored as a sparse matrix (although it
is initially created as a dense matrix).  The random-effects model
matrix is created as a sparse matrix and it usually isn't the problem
with memory usage.

If you do succeed in fitting a linear mixed model to these data using
lmer2 I would be interested in the sizes of some of the slots in the
fitted model.  I enclose a short transcript showing one way of
checking these sizes on an S4 object.

Regarding the possibility of a memory leak - I wouldn't be shocked if
I had managed to create a memory leak but the behavior that you
mention is consistent with the garbage collection.  At present the
optimization of the deviance for generalized linear mixed models goes
through the nlminb function in R which means that the deviance
evaluation must be an R function.  Thus there are R objects created
within the optimization that must be garbage collected.  I think I
know a way around this and it is on my "To Do" list to check it out
but that list is pretty long these days so I can't promise anything.

Thanks for writing to the list.  I'll be interested in whether it is
possible to work with such large data sets effectively.

--
Daniel Pemstein
Department of Political Science
University of Illinois at Urbana-Champaign
702 S. Wright St.
Urbana, IL 61801

Email: dbp at uiuc.edu