LMM with Big data using binary DV

Fri, Feb 10, 2012 4:20 AM

Dear AC (and perhaps Doug),

You may be well aware of this, but one way of substantially speeding up the estimation of models with binary data is to use "cbind", though the feasibility of this depends on the nature of the model (number of predictors and number of unique values for each predictor variable). The kind of code you need for this is below. You'll see that the fixed effects estimates, standard errors, and random effects variances turn out the same.

I've compared the speed of lme4 versus lme4Eigen, on a simulated dataset with 100,000 observations, using a 2GHz MacBook. Based on a handful of simulations, there doesn't appear to be much difference between the two packages in terms of speed (sometimes one is faster, sometimes the other). I have reported the results of one simulation here. The two packages generate identical results for this dataset.

Cheers,
Malcolm


N <- 100000
grps <- 100
dat <- data.frame(x1 = sample(1:10, N, replace=T), x2 = sample(18:23, N, replace=T), grp=rep(1:grps, each=N/grps))
dat$y <- rbinom(N, prob = plogis(-5 + 0.1*dat$x1 + 0.2*dat$x2 + rnorm(grps)[dat$grp]), size = 1)
failures <- by(dat, list(dat$x1, dat$x2, dat$grp), function(x) sum(x$y==0))
successes <- by(dat, list(dat$x1, dat$x2, dat$grp), function(x) sum(x$y==1))
dat2 <- expand.grid(x1=sort(unique(dat$x1)), x2=sort(unique(dat$x2)), grp=sort(unique(dat$grp)))
dat2$failures <- as.vector(failures)
dat2$successes <- as.vector(successes)
library(lme4)
system.time(glmer(y ~ x1 + x2 + (1 | grp), dat, family=binomial))
#   user  system elapsed 
# 22.918   0.660  24.441
system.time(glmer(cbind(successes, failures) ~ x1 + x2 + (1 | grp), dat2, family=binomial))
#   user  system elapsed 
#  1.833   0.017   1.855 
detach("package:lme4")
library(lme4Eigen)
system.time(glmer(y ~ x1 + x2 + (1 | grp), dat, family=binomial))
#   user  system elapsed 
# 24.824   1.811  26.773 
system.time(glmer(cbind(successes, failures) ~ x1 + x2 + (1 | grp), dat2, family=binomial))
#   user  system elapsed 
#  1.687   0.039   1.723

Date: Thu, 9 Feb 2012 14:13:24 -0600
From: Douglas Bates <bates at stat.wisc.edu>
To: Joshua Wiley <jwiley.psych at gmail.com>
Cc: AC Del Re <acdelre at stanford.edu>, r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] LMM with Big data using binary DV

On Wed, Feb 8, 2012 at 8:28 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:

Hi AC,

My personal preference would be glmer from the lme4 package. ?I prefer
the Laplace approximation for the likelihood over the quasilikelihood
in glmmPQL. ?To give some exemplary numbers, I simulated a dataset
with 2 million observations nested within 200 groups (10,000
observations per group). ?I then ran an random intercepts model using:

system.time(m <- glmer(Y ~ X + W + (1 | G), family = "binomial"))

where the matrices/vectors are of sizes: Y = [2 million, 1]; X = [2
million, 6]; W = [2 million, 3]; G = [2 million, 1]

This took around 481 seconds to fit on a 1.6ghz dual core laptop.
With the OS and R running, my system used ~ 6GB of RAM for the model
and went up to ~7GB to show the summary (copies of the data are
made---changed in the upcoming version of lme4).

So as long as you have plenty of memory, you should have no trouble
modelling your data using glmer(). ?To initially make sure all your
code works, I might use a subset of your data (say 10k), once you are
convinced you have the model you want, run it on the full data.

If you would have an opportunity to run that model fit or a comparable
on lme4Eigen::glmer we would appreciate information about speed,
accuracy and memory usage.

In lme4Eigen::glmer there are different levels of precision in the
approximation to the deviance being optimizer.  These are controlled
by the nAGQ argument to the function.  The default, nAGQ=1, uses the
Laplace approximation.  The special value nAGQ=0 also uses the Laplace
approximation but profiles out the fixed-effects parameters.  This
profiling is not exact but usually gets you close to the optimum that
you would get from nAGQ=1, but much, much faster.  In a model like
this you can also use nAGQ>1 and <= 25.  On the model fits we have
tried we don't see a lot of difference in timing between, say, nAGQ=9
and nAGQ=25 but on a model fit like this you might.

As a fallback, we would appreciate the code that you used to simulate
the response.  We could generate something ourselves, of course, but
it is easier to compare when you copy someone else's simulation.

On Wed, Feb 8, 2012 at 5:28 PM, AC Del Re <acdelre at stanford.edu> wrote:

Hi,

I have a huge dataset (2.5 million patients nested within ?> 100
facilities) and would like to examine variability across facilities in
program utilization (0=n, 1=y; utilization rates are low in general), along
with patient and facility predictors of utilization.

I have 3 questions:

1. What program and/or package(s) do you recommend for running LMMs with
big data (even if they are not R packages)?

2. Are there any clever work arounds (e.g., random sampling of subset of
data, etc) that would allow me to use only R packages to run this dataset
(assuming I need to use another program due to the size of the dataset)?

3. What type of LMM is recommended with a binary DV similar to the one I am
wanting to examine? I know of two potential options (family=binomial option
in lmer and the glmmPQL in the MASS package) but am not sure which is more
appropriate or what other R packages and functions are available for this
purpose?

Thank you,

AC

LMM with Big data using binary DV

Thread (4 messages)