Convergence problem example

Let me expand a bit on what Ben says here.

  Just a quick comment.

  As long-time readers of this forum may know (I think I've said it here
before), the convergence warnings provided by lme4 are often
over-sensitive.  They're based on *post hoc* attempts (after the
nonlinear optimizer thinks it has converged) to evaluate the KKT
criteria (in this case reducing to zero gradient, negative
positive-definite Hessian) by computing the finite-difference
approximations of the gradient and Hessian of the negative
log-likelihood.  The problem is that this computation is just as subject
to numeric error as the nonlinear optimization itself, so it doesn't
work very well. (Some of this is stated in ?lme4::convergence)

  Doug Bates feels that attempts to fix the problem are
"toast-scraping", i.e. attempts to fix something that was a bad idea in
the first place (i.e., I/we should throw away the metaphorical burned
toast and scrap the post-hoc convergence tests, relying on the optimizer
to tell us if it thinks it has converged successfully). The Catch-22 is
that (1) I'm afraid to throw a system that might catch a reasonable
fraction of truly bad convergence cases; (2) I don't want to flip-flop
(take the tests out, then decide that they were a good idea after all
and reintroduce them); (3) making an *informed* decision what to do
(e.g. throwing out the tests or setting a more appropriate threshold)
would require a lot of work generating test cases -- an in order to get
a really good handle on the problem, we'd have to generate not just
"nice" test cases but all kinds of pathological examples where there
really are convergence problems.

The "toast-scaping" is a reference Ed Deming's comments about trying to do
quality control by inspection after the fact.  He said this was a case of
"burning the toast and then scraping it" when the better course was not to
burn the toast in the first place.

Determining  ML or REML estimates for linear mixed-effects models can be a
difficult optimization problem.  The way we formulate it in lme4 is as a
constrained optimization problem but with rather simple constraints.  Some
of the components of the parameter vector must be non-negative. Sometimes
the problem can be very simple - optimize with respect to a single
parameter - and sometimes it can be relatively difficult.  Some algorithms
may work well on one type of problem and others work well on a different
problem.  My recent experience is that the NLopt (
https://nlopt.readthedocs.io/) implementation of the BOBYQA (Bounded
Optimization BY Quadratic Approximation) algorithm is both reliable and
fast.  It is available through the nloptr package for R.  In the Julia code
I use the NLopt.jl package (https://github.com/JuliaOpt/NLopt.jl).  I
haven't run into a case where another algorithm or implementation has
converged to a (substantially) better optimum than does this one.  (That
is, there may be a slightly better optimum from another algorithm but the
differences are negligible.)

My recommendation is to change the default algorithm to the nloptr
implementation of BOBYQA and drop the attempts to check convergence.  Some
of the problems with the convergence checks are:

- The conditions like the gradient being zero don't apply when convergence
is on the boundary

- The calculations are based on finite-difference derivatives and (I think)
finite differences of finite differences.  These are notoriously inaccurate.

- In any floating point calculation you can't expect an exact answer so
then you need to decide when you are close enough to, say, a gradient of
zero. This can be very difficult to determine.

The optimization for GLMMs can be even more difficult because, in the
second stage at least, the fixed-effects parameters are part of the general
optimization and the objective (deviance) function doesn't have a closed
form expression.  For some cases we can use adaptive Gauss-Hermite
quadrature, in other cases we use the Laplace approximation.   I was
responsible for the two-stage approach of doing an initial, fast,
optimization where the fixed-effects parameters were optimized in the PIRLS
(penalized iteratively reweighted least squares) evaluation followed by the
second, slower optimization with the fixed-effects parameters in the
general optimization.  The first stage may not even be necessary.  However,
I think that this again is a case where the optimization code - for both
stages - should default to the NLopt implementation of BOBYQA.

I would be quite happy to participate in studying different approaches to
the optimization using data sets from the literature and contributed data
and models.  I have started something like this in the benchmark section of
the MixedModels package for Julia.  At present it is just checking the
execution time for about 50 different linear mixed model fits but the code
could reasonably be modified to try different optimization methods and to
test GLMMs as well.  I realize that most readers of this list would want to
work in R but there are definite advantages in a Julia implementation.  For
one thing MixedModels is written completely in Julia whereas lme4 is an
exotic mixture of languages with all the delightful interfaces between
languages and packages.

  Besides being hard and tedious, this is time I don't really have.
Volunteers ... ?

  cheers
   Ben Bolker

On 2018-04-25 11:12 AM, Phillip Alday wrote:
Changing the optimizer seems to address the convergence warnings and has
no impact on the other aspects of the fit:

m1 <- update(m1, control=glmerControl(optimizer="bobyqa"))
summary(m1)
Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 30) [glmerMod]
 Family: binomial  ( logit )
Formula: fscr ~ lncfs + ai + heifer + (1 | cow)
   Data: dairy
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid
  4017.8   4047.9  -2003.9   4007.8     3022

Scaled residuals:
    Min      1Q  Median      3Q     Max
-1.8136 -0.7652 -0.6479  1.0420  1.6213

Random effects:
 Groups Name        Variance Std.Dev.
 cow    (Intercept) 0.3419   0.5847
Number of obs: 3027, groups:  cow, 1575

Fixed effects:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -1.55696    0.43481  -3.581 0.000343 ***
lncfs              0.52222    0.10007   5.218  1.8e-07 ***
aiai              -1.09560    0.12341  -8.878  < 2e-16 ***
heiferprimiparous -0.08259    0.09718  -0.850 0.395418
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Correlation of Fixed Effects:
            (Intr) lncfs  aiai
lncfs       -0.965
aiai        -0.193 -0.046
heifrprmprs -0.054  0.016 -0.051

Phillip

On 04/25/2018 04:56 PM, Paul Johnson wrote:
In the book Multilevel and Longitudinal Modeling using Stata,
Rabe-Hesketh and Skrondal have a lot of exercises and over the years
I've been trying to write Stata and R code to demonstrate
similarities/differences.

I've run into an example where Stata (either with builtin methods or
the addon gllamm) seems to think it gets estimates but lme4
diagnostics say there is a convergence failure.  I want to impose on
you, get your advice about it. We see convergence warnings quite often
with lme4, but they are usually the "rescale your variables" errors,
not as blunt as this.

The first part retrieves "dairy.dta" from the book website

library(foreign)
library(lme4)

fn <- "dairy"
if (!file.exists(paste0(fn, ".dta12"))) {
    download.file(paste0("http://www.stata-press.com/data/mlmus3/",
fn, ".dta"),
                  destfile = paste0(fn, ".dta12"))
}

dairy <- read.dta(paste0(fn, ".dta12"))

#1
m1 <- glmer(fscr ~ lncfs + ai + heifer + (1 | cow),
            family = binomial(link = "logit"), nAGQ = 30,
            data = dairy)
summary(m1)

From that, I see:
m1 <- glmer(fscr ~ lncfs + ai + heifer + (1 | cow),
+             family = binomial(link = "logit"), nAGQ = 30,
+             data = dairy)
Warning messages:
1: 'rBind' is deprecated.
 Since R version 3.2.0, base's rbind() should work fine with S4 objects
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl =
control$checkConv,  :
  Model failed to converge with max|grad| = 0.00396932 (tol = 0.001,
component 1)
summary(m1)
Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 30) [glmerMod]
 Family: binomial  ( logit )
Formula: fscr ~ lncfs + ai + heifer + (1 | cow)
   Data: dairy

     AIC      BIC   logLik deviance df.resid
  4017.8   4047.9  -2003.9   4007.8     3022

Scaled residuals:
    Min      1Q  Median      3Q     Max
-1.8136 -0.7652 -0.6479  1.0420  1.6213

Random effects:
 Groups Name        Variance Std.Dev.
 cow    (Intercept) 0.3419   0.5847
Number of obs: 3027, groups:  cow, 1575

Fixed effects:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -1.55693    0.43481  -3.581 0.000343 ***
lncfs              0.52221    0.10007   5.218 1.81e-07 ***
aiai              -1.09560    0.12341  -8.878  < 2e-16 ***
heiferprimiparous -0.08259    0.09718  -0.850 0.395410
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Correlation of Fixed Effects:
            (Intr) lncfs  aiai
lncfs       -0.965
aiai        -0.193 -0.046
heifrprmprs -0.054  0.016 -0.051
convergence code: 0
Model failed to converge with max|grad| = 0.00396932 (tol = 0.001,
component 1)

Stata output for comparison purposes, using gllamm:

. gllamm fscr lncfs ai heifer, i(cow) link(logit) fam(binom) adapt

Running adaptive quadrature
Iteration 0:    log likelihood = -2004.6011
Iteration 1:    log likelihood = -2003.9085
Iteration 2:    log likelihood = -2003.9069

Adaptive quadrature has converged, running Newton-Raphson
Iteration 0:   log likelihood = -2003.9069
Iteration 1:   log likelihood = -2003.9069
Iteration 2:   log likelihood = -2003.9064
Iteration 3:   log likelihood = -2003.9064

number of level 1 units = 3027
number of level 2 units = 1575

Condition Number = 47.123877

gllamm model

log likelihood = -2003.9064

------------------------------------------------------------------------------
        fscr |      Coef.   Std. Err.      z    P>|z|     [95% Conf.
Interval]

-------------+----------------------------------------------------------------
       lncfs |   .5222157   .1000693     5.22   0.000     .3260834
 .718348
          ai |  -1.095598   .1234099    -8.88   0.000    -1.337477
 -.8537193
      heifer |  -.0825878   .0971811    -0.85   0.395    -.2730593
.1078837
       _cons |  -1.556961   .4348008    -3.58   0.000    -2.409155
 -.7047673

------------------------------------------------------------------------------

Variances and covariances of random effects

------------------------------------------------------------------------------

***level 2 (cow)

    var(1): .34188062 (.1263136)

------------------------------------------------------------------------------

Using the newer meglm that is provided with Stata

. meglm fscr lncfs ai heifer || cow: , family(binom) link(logit)

Fitting fixed-effects model:

Iteration 0:   log likelihood = -2011.8253
Iteration 1:   log likelihood = -2009.1421
Iteration 2:   log likelihood = -2009.1412
Iteration 3:   log likelihood = -2009.1412

Refining starting values:

Grid node 0:   log likelihood = -2015.6021

Fitting full model:

Iteration 0:   log likelihood = -2015.6021
Iteration 1:   log likelihood =  -2006.709
Iteration 2:   log likelihood = -2003.9174
Iteration 3:   log likelihood = -2003.9065
Iteration 4:   log likelihood = -2003.9065

Mixed-effects GLM                               Number of obs     =
  3,027
Family:                binomial
Link:                     logit
Group variable:             cow                 Number of groups  =
  1,575
                                                Obs per group:
                                                              min =
      1
                                                              avg =
    1.9
                                                              max =
      5
Integration method: mvaghermite                 Integration pts.  =
      7
                                                Wald chi2(3)      =
 103.86
Log likelihood = -2003.9065                     Prob > chi2       =
 0.0000

------------------------------------------------------------------------------
        fscr |      Coef.   Std. Err.      z    P>|z|     [95% Conf.
Interval]

-------------+----------------------------------------------------------------
       lncfs |   .5222154   .1000694     5.22   0.000      .326083
.7183479
          ai |  -1.095597   .1234095    -8.88   0.000    -1.337476
 -.8537191
      heifer |  -.0825878   .0971811    -0.85   0.395    -.2730593
.1078836
       _cons |  -1.556961   .4348012    -3.58   0.000    -2.409155
 -.7047659

-------------+----------------------------------------------------------------
cow          |
   var(_cons)|   .3418776   .1263105                      .1657237
.7052721

------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 10.47       Prob >= chibar2 =
0.0006

As far as I can see, parameter estimates are the same. I did not
compare conditional modes.

If lme4's converge warning is valid, then we have a concrete case
where Stata is reporting the wrong thing.  I can see some upside there
:)

R details

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] lme4_1.1-15    Matrix_1.2-14  foreign_0.8-69

loaded via a namespace (and not attached):
 [1] minqa_1.2.4     MASS_7.3-49     compiler_3.4.4  tools_3.4.4
 [5] Rcpp_0.12.15    splines_3.4.4   nlme_3.1-137    grid_3.4.4
 [9] nloptr_1.0.4    lattice_0.20-35

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models