large dataset
I should have mentioned this in my earlier reply - please use version 0.9975-12 of lme4 when checking with lmer2. I just uploaded this version to CRAN and it should appear on the main site and the mirrors in a day or two. You can get it now from the SVN archive https://svn.r-project.org/R-packages/trunk/lme4 In an earlier thread on this list Andrew Robinson described how he was unable to run even the simplest examples of lmer2 on a FreeBSD system. With his help we finally tracked down the dumb error that I had made in the C function mer2_getPars and fixed it for the -12 release. Under Linux the bug was not causing a memory error but it certainly would use up much more memory than necessary during the iterations.
On 1/30/07, Douglas Bates <bates at stat.wisc.edu> wrote:
On 1/30/07, Dan Pemstein <dbp at uiuc.edu> wrote:
I'm attempting to fit a crossed random effects model to a rather large data set. This is EU parliament voting data (the response variable is binary) from 574 legislators over 2123 votes. EU parliamentarians miss a lot of votes so there are ~700,000 total observations. The model also includes quite a few covariates---on the order of 30-50 (mostly fixed effects for country, party, etc), depending on the particular specification. I'm having some serious issues fitting a crossed effects logit model to this data with lme4 without exhausting system memory. I have a quad-core intel linux machine with 8 gigs of ram and a lot of swap to play with, but I'm still falling short. Interestingly, I've successfully fit this model using HLM6 on a machine with substantially less RAM.
My question is largely about feasibility. I would like to use lme4 to analyze this dataset because it provides a much better set of features for checking model fit and generating predictions than HLM (one can't even get the fixed effects variance-covariance matrix out of HLM6's crossed effects routine). Is this impossible? Are there any ways to reduce lmer's memory footprint that I might try? Would one expect a cross-classified logit model with 700,000 observations to require upwards of 12 gigs of memory or have I uncovered a small memory leak that isn't visible with smaller datasets? The memory use creeps up slowly over the course of a run which is at least consistent with a memory leak, but, not knowing anything about the implementation, I'm just speculating wildly here. Obviously, I could sub-sample, but this is already a sample of a larger dataset, so I'm loathe to do that if I can avoid it.
Could you try to fit the response with a linear mixed model using the lmer2 function that is in versions 0.9975-11 and later of the lme4 package? I know the model is inappropriate but I just want to get a handle on whether the mer2 representation saves enough storage to make working with such a data set and model feasible. I shouldn't speculate without actually examining the model fit myself but I think the memory hog may be the fixed-effects model matrix. Currently that model matrix must be created as a dense matrix and it must be created using all the rows. When you say that you have 30-50 covariates (and I assume that some of them may be factors) then that matrix could be the one that is breaking the bank. In lmer2 the fixed-effects model matrix is stored as a sparse matrix (although it is initially created as a dense matrix). The random-effects model matrix is created as a sparse matrix and it usually isn't the problem with memory usage. If you do succeed in fitting a linear mixed model to these data using lmer2 I would be interested in the sizes of some of the slots in the fitted model. I enclose a short transcript showing one way of checking these sizes on an S4 object. Regarding the possibility of a memory leak - I wouldn't be shocked if I had managed to create a memory leak but the behavior that you mention is consistent with the garbage collection. At present the optimization of the deviance for generalized linear mixed models goes through the nlminb function in R which means that the deviance evaluation must be an R function. Thus there are R objects created within the optimization that must be garbage collected. I think I know a way around this and it is on my "To Do" list to check it out but that list is pretty long these days so I can't promise anything. Thanks for writing to the list. I'll be interested in whether it is possible to work with such large data sets effectively.