I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions terms and
3 random effects. It takes more than 15 min to run on my laptop (recent
intel core i7, RAM = 4GO). Thus, the IT department of the University I am
working at developed a Rstudio server based on the Ubuntu system. My
problem is that 8 cores are available on this server but when I run the *glmer
*procedure, only 1 of them is being used and it takes more than 1h to get
the results... How can I solve that problem and improve time efficiency? I
found on google I may have to use the parallel procedure but (i) I am not
familiar at all with those informatics procedures and they look a bit
complicated, (ii) the code I picked works with other functions in other
packages such as *kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more-cpu-and-memory)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ?
9 messages · Douglas Bates, Ben Bolker, Doran, Harold +2 more
The procedure is fairly simple - just rewrite the lme4 package from scratch. :-)
On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions terms and
3 random effects. It takes more than 15 min to run on my laptop (recent
intel core i7, RAM = 4GO). Thus, the IT department of the University I am
working at developed a Rstudio server based on the Ubuntu system. My
problem is that 8 cores are available on this server but when I run the
*glmer
*procedure, only 1 of them is being used and it takes more than 1h to get
the results... How can I solve that problem and improve time efficiency? I
found on google I may have to use the parallel procedure but (i) I am not
familiar at all with those informatics procedures and they look a bit
complicated, (ii) the code I picked works with other functions in other
packages such as *kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
@DB, I thought you were retired :) But, to the OP, lme4 functions already take advantage of many computational methods that make computing these models to large data sets faster than (virtually) all other packages for estimating mixed linear models. The packages you might come across for parallel processing won't necessarily apply here. For example, the foreach package is fantastic, but could not be applied to a glmer model. Although, Doug, I do recall coming across some work I think in the Microsoft R distribution that did some parallel computing for matrix problems by default. I'm saying this by memory and cannot recall specifics. With that said, I'm not certain parallel processing is the right thing to do with problems of this sort. Iteration t+1 depends on iteration t and when solutions to the problem live on a different processor, the expense of combining those things back together is not always faster, but instead can actually be even more expensive and slower. -----Original Message----- From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Douglas Bates Sent: Thursday, January 18, 2018 3:07 PM To: Nicolas B?d?re <n.bedere at gmail.com> Cc: R SIG Mixed Models <r-sig-mixed-models at r-project.org> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ? The procedure is fairly simple - just rewrite the lme4 package from scratch. :-)
On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions
terms and
3 random effects. It takes more than 15 min to run on my laptop
(recent intel core i7, RAM = 4GO). Thus, the IT department of the
University I am working at developed a Rstudio server based on the
Ubuntu system. My problem is that 8 cores are available on this server
but when I run the *glmer *procedure, only 1 of them is being used and
it takes more than 1h to get the results... How can I solve that
problem and improve time efficiency? I found on google I may have to
use the parallel procedure but (i) I am not familiar at all with those
informatics procedures and they look a bit complicated, (ii) the code
I picked works with other functions in other packages such as
*kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more
-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
On Thu, Jan 18, 2018 at 2:07 PM Douglas Bates <bates at stat.wisc.edu> wrote:
The procedure is fairly simple - just rewrite the lme4 package from scratch. :-)
On a less facetious note, you may find it worthwhile installing Julia (see https://julialang.org/downloads) and the MixedModels package. The MixedModels package itself is not multi-threaded but most of the linear algebra goes through the BLAS (Basic Linear Algebra Subroutines) and, by default, Julia is compiled against OpenBLAS. You can, in a reasonably straightforward way - as these things go, compile Julia against Intel's Math Kernel Library (MKL) which helps accelerate the linear algebra. An accelerated BLAS is not likely to buy you too much in lme4 because the linear algebra uses Eigen and SuiteSparse.
On Thu, Jan 18, 2018 at 2:16 PM Doran, Harold <HDoran at air.org> wrote:
@DB, I thought you were retired :)
I am retired. I'm just not very good at it and keep coming in to the office to work on various projects. But, to the OP, lme4 functions already take advantage of many computational
methods that make computing these models to large data sets faster than (virtually) all other packages for estimating mixed linear models.
The MixedModels package in Julia will usually perform at least as well as lme4 and sometimes much better. Of course, using it entails learning a bit of Julia. I would point out that with the RCall and RData packages for Julia it is fairly straightforward to pass the data back and forth between R and Julia. The packages you might come across for parallel processing won't
necessarily apply here. For example, the foreach package is fantastic, but could not be applied to a glmer model. Although, Doug, I do recall coming across some work I think in the Microsoft R distribution that did some parallel computing for matrix problems by default. I'm saying this by memory and cannot recall specifics.
The Microsoft R distribution (and, before that, Revolution R) use the MKL BLAS that I mentioned. Thanks for the reminder. It may be worthwhile trying with lme4. Those benchmarks are somewhat disingenuous because they only benchmark some linear algebra operations which is what MKL does very well. Interestingly, the most important operation for statisticians - obtaining least squares solutions - is not accelerated in the standard R solution.
With that said, I'm not certain parallel processing is the right thing to do with problems of this sort. Iteration t+1 depends on iteration t and when solutions to the problem live on a different processor, the expense of combining those things back together is not always faster, but instead can actually be even more expensive and slower.
Parallelizing model fitting code is very tricky. -----Original Message-----
From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Douglas Bates Sent: Thursday, January 18, 2018 3:07 PM To: Nicolas B?d?re <n.bedere at gmail.com> Cc: R SIG Mixed Models <r-sig-mixed-models at r-project.org> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ? The procedure is fairly simple - just rewrite the lme4 package from scratch. :-) On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions
terms and
3 random effects. It takes more than 15 min to run on my laptop
(recent intel core i7, RAM = 4GO). Thus, the IT department of the
University I am working at developed a Rstudio server based on the
Ubuntu system. My problem is that 8 cores are available on this server
but when I run the *glmer *procedure, only 1 of them is being used and
it takes more than 1h to get the results... How can I solve that
problem and improve time efficiency? I found on google I may have to
use the parallel procedure but (i) I am not familiar at all with those
informatics procedures and they look a bit complicated, (ii) the code
I picked works with other functions in other packages such as
*kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more
-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Explaining a little bit more; unlike a lot of informatics/machine
learning procedures, the algorithm underlying lme4 is not naturally
parallelizable. There are components that *could* be done in parallel,
but it's not simple.
If you need faster computation, you could either try Doug's
MixedModels.jl package for Julia, or the glmmTMB package (on CRAN),
which may scale better than glmer for problems with large numbers of
fixed-effect parameters (although my guess is that it's close to a tie
for the problem specs you quote below, unless your fixed effects are
factors with several levels).
Sometimes installing better-optimized linear algebra libraries or
better-optimized builds of R can help (optimized BLAS or Microsoft's
"R Open"), although likely not in the case of lme4.
My other comment is that a lot of the computational load of modeling
has to do with running lots of different models, not with how long a
single model takes. For example,
- likelihood profiling
- parametric bootstrapping
- model comparison and testing via likelihood ratio tests or
information criteria
- model selection (ugh)
Are all procedures that can be easily parallelized (support for
parallel computation is built-in for the first two).
cheers
Ben Bolker
On Thu, Jan 18, 2018 at 3:07 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
The procedure is fairly simple - just rewrite the lme4 package from scratch. :-) On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions terms and
3 random effects. It takes more than 15 min to run on my laptop (recent
intel core i7, RAM = 4GO). Thus, the IT department of the University I am
working at developed a Rstudio server based on the Ubuntu system. My
problem is that 8 cores are available on this server but when I run the
*glmer
*procedure, only 1 of them is being used and it takes more than 1h to get
the results... How can I solve that problem and improve time efficiency? I
found on google I may have to use the parallel procedure but (i) I am not
familiar at all with those informatics procedures and they look a bit
complicated, (ii) the code I picked works with other functions in other
packages such as *kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
A while back, I did run lmer using a very large model in Microsoft R vs R and the timing was indeed faster for the same model on the same computer. Not by any meaningful order of magnitude that would be life changing, but faster nonetheless. From: Douglas Bates <bates at stat.wisc.edu<mailto:bates at stat.wisc.edu>> Date: Thursday, January 18, 2018 at 3:30 PM To: AIR <hdoran at air.org<mailto:hdoran at air.org>> Cc: Nicolas B?d?re <n.bedere at gmail.com<mailto:n.bedere at gmail.com>>, "r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>" <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ?
On Thu, Jan 18, 2018 at 2:16 PM Doran, Harold <HDoran at air.org<mailto:HDoran at air.org>> wrote:
@DB, I thought you were retired :) I am retired. I'm just not very good at it and keep coming in to the office to work on various projects. But, to the OP, lme4 functions already take advantage of many computational methods that make computing these models to large data sets faster than (virtually) all other packages for estimating mixed linear models. The MixedModels package in Julia will usually perform at least as well as lme4 and sometimes much better. Of course, using it entails learning a bit of Julia. I would point out that with the RCall and RData packages for Julia it is fairly straightforward to pass the data back and forth between R and Julia. The packages you might come across for parallel processing won't necessarily apply here. For example, the foreach package is fantastic, but could not be applied to a glmer model. Although, Doug, I do recall coming across some work I think in the Microsoft R distribution that did some parallel computing for matrix problems by default. I'm saying this by memory and cannot recall specifics. The Microsoft R distribution (and, before that, Revolution R) use the MKL BLAS that I mentioned. Thanks for the reminder. It may be worthwhile trying with lme4. Those benchmarks are somewhat disingenuous because they only benchmark some linear algebra operations which is what MKL does very well. Interestingly, the most important operation for statisticians - obtaining least squares solutions - is not accelerated in the standard R solution. With that said, I'm not certain parallel processing is the right thing to do with problems of this sort. Iteration t+1 depends on iteration t and when solutions to the problem live on a different processor, the expense of combining those things back together is not always faster, but instead can actually be even more expensive and slower. Parallelizing model fitting code is very tricky. -----Original Message----- From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org<mailto:r-sig-mixed-models-bounces at r-project.org>] On Behalf Of Douglas Bates Sent: Thursday, January 18, 2018 3:07 PM To: Nicolas B?d?re <n.bedere at gmail.com<mailto:n.bedere at gmail.com>> Cc: R SIG Mixed Models <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ? The procedure is fairly simple - just rewrite the lme4 package from scratch. :-)
On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com<mailto:n.bedere at gmail.com>> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions
terms and
3 random effects. It takes more than 15 min to run on my laptop
(recent intel core i7, RAM = 4GO). Thus, the IT department of the
University I am working at developed a Rstudio server based on the
Ubuntu system. My problem is that 8 cores are available on this server
but when I run the *glmer *procedure, only 1 of them is being used and
it takes more than 1h to get the results... How can I solve that
problem and improve time efficiency? I found on google I may have to
use the parallel procedure but (i) I am not familiar at all with those
informatics procedures and they look a bit complicated, (ii) the code
I picked works with other functions in other packages such as
*kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more
-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Mr Bates, Bolker and Harold, Thanks for your quick and enlightining answers! I will then have a look at the different solutions you proposed (Julia and glmTMB) waiting for you to rewrite your marvelous package from scratch to break through this limit! :-) Cheers 2018-01-19 0:52 GMT+01:00 Doran, Harold <HDoran at air.org>:
A while back, I did run lmer using a very large model in Microsoft R vs R and the timing was indeed faster for the same model on the same computer. Not by any meaningful order of magnitude that would be life changing, but faster nonetheless. From: Douglas Bates <bates at stat.wisc.edu> Date: Thursday, January 18, 2018 at 3:30 PM To: AIR <hdoran at air.org> Cc: Nicolas B?d?re <n.bedere at gmail.com>, "r-sig-mixed-models at r-project.org" <r-sig-mixed-models at r-project.org> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ? On Thu, Jan 18, 2018 at 2:16 PM Doran, Harold <HDoran at air.org> wrote:
@DB, I thought you were retired :)
I am retired. I'm just not very good at it and keep coming in to the office to work on various projects. But, to the OP, lme4 functions already take advantage of many
computational methods that make computing these models to large data sets faster than (virtually) all other packages for estimating mixed linear models.
The MixedModels package in Julia will usually perform at least as well as lme4 and sometimes much better. Of course, using it entails learning a bit of Julia. I would point out that with the RCall and RData packages for Julia it is fairly straightforward to pass the data back and forth between R and Julia. The packages you might come across for parallel processing won't
necessarily apply here. For example, the foreach package is fantastic, but could not be applied to a glmer model. Although, Doug, I do recall coming across some work I think in the Microsoft R distribution that did some parallel computing for matrix problems by default. I'm saying this by memory and cannot recall specifics.
The Microsoft R distribution (and, before that, Revolution R) use the MKL BLAS that I mentioned. Thanks for the reminder. It may be worthwhile trying with lme4. Those benchmarks are somewhat disingenuous because they only benchmark some linear algebra operations which is what MKL does very well. Interestingly, the most important operation for statisticians - obtaining least squares solutions - is not accelerated in the standard R solution.
With that said, I'm not certain parallel processing is the right thing to do with problems of this sort. Iteration t+1 depends on iteration t and when solutions to the problem live on a different processor, the expense of combining those things back together is not always faster, but instead can actually be even more expensive and slower.
Parallelizing model fitting code is very tricky. -----Original Message-----
From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Douglas Bates Sent: Thursday, January 18, 2018 3:07 PM To: Nicolas B?d?re <n.bedere at gmail.com> Cc: R SIG Mixed Models <r-sig-mixed-models at r-project.org> Subject: Re: [R-sig-ME] How can I make R using more than 1 core (8 available) on a Ubuntu Rstudio server ? The procedure is fairly simple - just rewrite the lme4 package from scratch. :-) On Thu, Jan 18, 2018 at 2:03 PM Nicolas B?d?re <n.bedere at gmail.com> wrote:
I want to run the *glmer* procedure on a ?large? dataset (250,000
observations). The model includes 5 fixed effects, 2 interactions
terms and
3 random effects. It takes more than 15 min to run on my laptop
(recent intel core i7, RAM = 4GO). Thus, the IT department of the
University I am working at developed a Rstudio server based on the
Ubuntu system. My problem is that 8 cores are available on this server
but when I run the *glmer *procedure, only 1 of them is being used and
it takes more than 1h to get the results... How can I solve that
problem and improve time efficiency? I found on google I may have to
use the parallel procedure but (i) I am not familiar at all with those
informatics procedures and they look a bit complicated, (ii) the code
I picked works with other functions in other packages such as
*kmeans{stats}* (
https://stackoverflow.com/questions/29998718/how-can-i-make-r-use-more
-cpu-and-memory
)
but neither with *lmer *nor *glmer.*
Can you please help with a simple procedure to tackle the problem?
Many thanks !
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
1 day later
On Thu, Jan 18, 2018 at 03:36:08PM -0500, Ben Bolker wrote:
Explaining a little bit more; unlike a lot of informatics/machine learning procedures, the algorithm underlying lme4 is not naturally parallelizable. There are components that *could* be done in parallel, but it's not simple. If you need faster computation, you could either try Doug's MixedModels.jl package for Julia, or the glmmTMB package (on CRAN), which may scale better than glmer for problems with large numbers of fixed-effect parameters (although my guess is that it's close to a tie for the problem specs you quote below, unless your fixed effects are factors with several levels).
I'm currently analysing a few huge datasets and in one of the cases the outcome was binary (in the other cases, the outcome was count data so I used negative binomial in glmmTMB), so I tried both glmer and glmmTMB and glmmTMB was faster. My model included about 11 fixed effects without interactions and three random intercept terms. However, I had problem getting a clean convergence when I tried to fit the model to the complete dataset, both with glmer and glmmTMB, and what I did might help Nicolas B?d?re too. I think the convergence problems in my case was related to the fact that the outcome was very rare, only 11.221 cases had the outcome (death), while 5.674.928 didn't have the outcome (the were alive). Anyway, I divided the dataset into 8 bins, and fitted the same model to each dataset, and since I had a 4 core CPU, 4 datasets could be independently fitted in parallel. Then I took the estimates and applied Rubin's Rule on them, to get pooled results. (In my particular case, I left all 11.221 positive cases in each of the 8 datasets, while each negative case only appeared in one of the 8 datasets.) I consider what I did as a kind of poor-man's-bootstrapping, but I would like to have some feedback on the valididity of results one gets with the method I used. If it is valid, then it is one way of parallelising glmer.
Hans Ekbrand, Fil Dr Epost/email: <hans.ekbrand at gu.se> Telefon/phone: +46-31 786 47 73 Institutionen f?r sociologi och arbetsvetenskap, G?teborgs universitet Department of sociology and work science, Gothenburg university