Dealing with large datasets in lme/lmer
Again thanks a lot for the help, Dr. Bates! Now with the development version of lmer I got an error for the following model which works fine with the old lmer: >fit.lmer <- lmer(y ~ FA*FB*FC+weight+(1|Subj), Model); Error in validObject(.Object) : invalid class "lmer" object: dims slot not named or incorrect length What went wrong? Regarding compiling, it works fine on a Mac OS X 10.4 with with duo 2.7GHz processors, but failed on a Mac OS X 10.4.10 with a 2 GHz Intel Core Duo processor with the following error (does it have something to do with the Intel processor?): > R CMD INSTALL ./lme4 * Installing to library '/Library/Frameworks/R.framework/Resources/ library' * Installing *source* package 'lme4' ... ** libs ** arch - i386 gcc-4.0 -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -no-cpp- precomp -I/Library/Frameworks/R.framework/Resources/include -I/ Library/Frameworks/R.framework/Resources/include/i386 -msse3 -I"/ Library/Frameworks/R.framework/Resources/library/Matrix/include" - D__NO_MATH_INLINES -fPIC -g -O2 -std=gnu99 -march=nocona -c init.c - o init.o gcc-4.0 -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -no-cpp- precomp -I/Library/Frameworks/R.framework/Resources/include -I/ Library/Frameworks/R.framework/Resources/include/i386 -msse3 -I"/ Library/Frameworks/R.framework/Resources/library/Matrix/include" - D__NO_MATH_INLINES -fPIC -g -O2 -std=gnu99 -march=nocona -c lmer.c - o lmer.o lmer.c:356:48: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'Mer_eta': lmer.c:356: error: 'N_AS_CHM_DN' undeclared (first use in this function) lmer.c:356: error: (Each undeclared identifier is reported only once lmer.c:356: error: for each function it appears in.) lmer.c:1448:40: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'nglmer_condMode': lmer.c:1448: error: 'N_AS_CHM_DN' undeclared (first use in this function) lmer.c:1449:30: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c:1718:45: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'lmer_MCMC_betab': lmer.c:1718: error: 'N_AS_CHM_DN' undeclared (first use in this function) make: *** [lmer.o] Error 1 chmod: /Library/Frameworks/R.framework/Versions/2.5/Resources/library/ lme4/libs/i386/*: No such file or directory ** arch - ppc gcc-4.0 -arch ppc -isysroot /Developer/SDKs/MacOSX10.4u.sdk - std=gnu99 -no-cpp-precomp -I/Library/Frameworks/R.framework/Resources/ include -I/Library/Frameworks/R.framework/Resources/include/ppc -I/ usr/local/include -I"/Library/Frameworks/R.framework/Resources/ library/Matrix/include" -fPIC -g -O2 -c init.c -o init.o gcc-4.0 -arch ppc -isysroot /Developer/SDKs/MacOSX10.4u.sdk - std=gnu99 -no-cpp-precomp -I/Library/Frameworks/R.framework/Resources/ include -I/Library/Frameworks/R.framework/Resources/include/ppc -I/ usr/local/include -I"/Library/Frameworks/R.framework/Resources/ library/Matrix/include" -fPIC -g -O2 -c lmer.c -o lmer.o lmer.c:356:48: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'Mer_eta': lmer.c:356: error: 'N_AS_CHM_DN' undeclared (first use in this function) lmer.c:356: error: (Each undeclared identifier is reported only once lmer.c:356: error: for each function it appears in.) lmer.c:1448:40: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'nglmer_condMode': lmer.c:1448: error: 'N_AS_CHM_DN' undeclared (first use in this function) lmer.c:1449:30: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c:1718:45: error: macro "N_AS_CHM_DN" passed 3 arguments, but takes just 2 lmer.c: In function 'lmer_MCMC_betab': lmer.c:1718: error: 'N_AS_CHM_DN' undeclared (first use in this function) make: *** [lmer.o] Error 1 chmod: /Library/Frameworks/R.framework/Versions/2.5/Resources/library/ lme4/libs/ppc/*: No such file or directory ERROR: compilation failed for package 'lme4' ** Removing '/Library/Frameworks/R.framework/Versions/2.5/Resources/ library/lme4' ** Restoring previous '/Library/Frameworks/R.framework/Versions/2.5/ Resources/library/lme4' Thanks, Gang ========= Gang Chen, Ph. D. National Institutes of Health, DHHS
On Sep 10, 2007, at 4:26 PM, Douglas Bates wrote:
The best way to obtain the package sources is with a subversion client. On Linux it is called svn. The call to check out a copy of the package sources is svn co https://svn.r-project.org/R-packages/branches/gappy-lmer ./lme4 You need the whole directory to build the package. Under Linux or Mac OS X I use R CMD INSTALL ./lme4 For Windows I use Uwe's win-builder at win-builder.R-project.org On 9/10/07, Gang Chen <gangchen at mail.nih.gov> wrote:
Dr. Bates, Thank you very much for the quick response and suggestions. I really want to try out the lmer development package, but could not figure out how to download it. I tried to run wget https://svn.r-project.org/R-packages/branches/gappy-lmer/ (Or wget --no-check-certificate -r -nv -m -np -nH https://svn.r- project.org/R-packages/branches/gappy-lmer/ ) but it didn't work. So how can I obtain the package? Do I only need the files under https://svn.r-project.org/R-packages/branches/gappy-lmer/R/ or do I need to compile/build the package with the source code somehow? Thanks, Gang ========= Gang Chen, Ph. D. National Institutes of Health, HHS On Sep 5, 2007, at 3:11 PM, Douglas Bates wrote:
On 9/4/07, Gang Chen <gangchen at mail.nih.gov> wrote:
Dear all,
I'm running mixed-effects analysis with large datasets in a loop like this:
for (i in 1:60) {
for (j in 1:60) {
for (k in 1:60) {
[...update y here in Model here...]
fit.lme <- lme(y ~ FA*FB*FC+weight, random = pdBlocked(list
(pdCompSymm(~FB-1), pdCompSymm(~FC-1), pdIdent(~1))),
weights=varIdent
(form=~1|FA), Model);
Stat[i, j, k,] <- anova(fit.lme)$F[-1];
}
}
}
Did you create the array Stat outside the loop? If not you will be doing a lot of copying of elements of that array.
This takes a little over 100 hours to finish on a Mac G5 with duo 2.7GHz processors and 4GB memory.
In the mixed-effects model
y = X*beta + Z*b + e
the fixed-effects nxp matrix X and random-effects matrix nxq Z are always the same for all the iterations in my case, and the only thing that differs is y (and the estimates of beta, b and e also differ of course). In my case n = 504 (large), p and q are moderate. I just read Dr. Douglas Bates's presentation during uerR! 2007 (very informative by the way):
Thank you.
It seems many components in the extended system matrix (equation (2) on page 22) for the Cholesky decomposition remain the same during the iterations. So there are a lot of repetitive computations on those same matrix operations in the above loop. How can I achieve a better efficiency? Someone suggested to me running lme/lmer with a two- dimensional response Y instead of one-dimensional y. My questions are:
(1) So far I have only seen people running lme/lmer with y in a format of one-dimensional array from a table. If I combine all those y's (indices i, j, k) into an two-dimensional array Y, is there a way I can run lme/lmer on Y instead of y? In other words, does lme/lmer take a two-dimensional array Y?
Not at present.
If so, do I have to save the huge array in a table in text file and then read in R before I run lme/ lmer?
No. There are many ways of getting data into R other than creating a text file and reading it. See the manual "R Data Import/Export" and also Martin Maechler's presentation at useR!2007. http://user2007.org/program/presentations/maechler.pdf
Also if that is the case, how can I label those many columns somehow associated with Y?
(2) A more serious concern is about memory. With the current looping approach it uses about 1GB. If I could possibly go with the matrix method described in (1), I'm worried that it might not be practically feasible with the current computers. Any thoughts?
Well first you are discussing the computational methods used in lmer but you want to fit a model with different residual variances for different groups. At present you can't do that in lmer. If you look at the lmer function in the development version of the lme4 package (currently at https://svn.r-project.org/R-packages/branches/gappy-lmer, soon to be at http://r-forge.r-project.org/projects/lme4 for some value of "soon") you will see that it follows the equations in my useR presentation fairly closely. The Xy array is n by (p + 1) with X in the first p columns and y in the p + 1st column. The object of class "lmer" has slots named y, Zt (Z-transpose), ZtXy (Zt %*% Xy), and XytXy (crossprod(Xy)). After fitting the model to the first simulated response, producing the object 'fm', the only operations needed to update the model are fm at y <- newy Xy <- cbind(fm at X, fm at y) fm at ZtXy <- fm at Zt %*% Xy fm at XytXy <- crossprod(Xy) lme4:::mer_finalize(fm, verbose) where 'verbose' is a logical scalar indicating if you want verbose output during the optimization phase. Once you get things working on a small example you would probably want to turn that off. Please note that this code applies to the development version of the lme4 package.