Why do extremely similar parametric bootstraps on lmer() objects yield (not only marginally) different results?

Sun, May 4, 2014 5:46 AM

I ran some more tests and I can safely say that the culprit in Method 2
and Method 3 is the call to:

dattemp <- mod at frame

and subsequently using it as the data argument in the foreach() loop calls
to lmer().

If I use the original data frame in Method 2 and Method 3 and set
na.action = na.exclude when fitting the model the (significant) difference
to using refit() disappears. It seems that copying only the data frame of
an S4 object and using this data frame when fitting/refitting a new model
with lmer() causes some sort of "behaviour" (I refrain from calling it
trouble. I do not fully grasp what is happing there.).

Best,
Christian
Eberhard Karls Universit?t T?bingen
Mathematisch-Naturwissenschaftliche Fakult?t - Faculty of Science
Evolutionary Cognition - Cognitive Science
Schleichstra?e 4 ? 72076 T?bingen ? Germany
Telefon +49 7071 29-75643 ? Telefax +49 7071 29-4721
christian.brauner at uni-tuebingen.de

On Sat, May 03, 2014 at 09:38:30PM +0200, Christian Brauner wrote:

 * is the difference between p=0.004 and p=0.006 _practically_
important?

No, I don?t think that it is important at all. Mostly because I don?t rely on p
values. (But a lot of people still like to see them.) In this special case it
is not important because we are looking at a p that is significantly smaller
than the specified alpha level; in this case the alpha level is the abundant
0.05. Still, the results differ at least by 0.002. This phenomenon seems stable
across different ways to treat NAs (as I?ve tried to show). I think this
suggests that one of the methods has a systematical bias. But I don?t know
which method. Hence, I don?t know if it is an anticonservative or a
conservative bias. It was pure curiosity which of the two methods it is. Any
hints on how to find out are welcome. (I saw you suggested looking at a single
refit while setting seed.)

 * for this example, it seems that (as you comment) almost any approach
will give reasonable answers -- simple LRT (because the number of levels
of the minimum-size grouping variable is >=40), Kenward-Roger (because
this is a LMM).  Parametric bootstrap feels like a waste of computation
time.

Agreed. But the computation time didn?t matter much to me since all these
things are run on a grid at 5 in the morning and finish in 30 min. Usually
nobody is doing anyhting on it at that time.

* I can't do it off the top of my head, but for a balanced/fully crossed
random effect, one might even be able to compute the denominator df
for the F distribution exactly.

I actually did not know that. If you have any papers, code or explanatio that
you could share with me I would greatly appreciate it!

* is there a reason you're not using refit() in method 2?  It should
be faster.

I am usually only comfortable with calling in functions which I had the time to
understand (reasonably) properly (e.g. by looking at its code.) and of which I
know how they're doing what they're doing. In this manner I can make sure that
I can stand for the things I calculated. I am not that acquainted with refit()
so I was up until now relying on doing stuff by hand. Furthermore, I like to do
stuff by hand first. This is just to check if I know what I am doing.

* does PBmodcomp() work with the latest version of lme4, with the
simulate/NA fix?  Or do we or the pbkrtest maintainers still have work
to do?

I just did a test with pbkrtest_0.3-8 and the github version of lme4_1.1-7. If
I specify my models:

mod8 <- lmer(formula = log(value)
             ~ passung
             + (1 + satz_typ | vp)
             + (1 + satz_typ | kontext),
             data = wort12_lmm2_melt,
             REML = FALSE)

mod_min <- lmer(formula = log(value)
                ~ 1
                + (1 + satz_typ | vp)
                + (1 + satz_typ | kontext),
                data = wort12_lmm2_melt,
                REML = FALSE)

and call:

pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
                                smallModel = mod_min_nona,
                                nsim       = 100,
                                cl         = cl)

without removing the na.action attribute, I get:

pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,

+                                 smallModel = mod_min_nona,
+                                 nsim       = 100,
+                                 cl         = cl)
Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  one node produced an error: replacement has 10263 rows, data has 10713

* I'm a little bit worried by method 2 -- I can imagine (although haven't
tested) that ysim may be a different length/may not match mod at frame,
which will be NA-processed. However, I think it should work OK
if the default na.action is 'na.omit' (I'd prefer that people _never_ used @
in their code ...)

That actually was a problem. If I fit the models with lmer(..., na.action =
na.exclude) but leave the rest the way it is. I will get a model.frame.default
error message which has to do with differing lengths.  That?s why I redid
Method 2 with Method 3 were I constructed a data frame with complete cases.

Thank you so much for taking a look at this!
Christian

Eberhard Karls Universit?t T?bingen
Mathematisch-Naturwissenschaftliche Fakult?t - Faculty of Science
Evolutionary Cognition - Cognitive Science
Schleichstra?e 4 ? 72076 T?bingen ? Germany
Telefon +49 7071 29-75643 ? Telefax +49 7071 29-4721
christian.brauner at uni-tuebingen.de


On Sat, May 03, 2014 at 02:14:30PM -0400, Ben Bolker wrote:

On 14-05-03 07:59 AM, Christian Brauner wrote:

Hello all,

warning ahead: This will be quite a long post. But have faith, it is
 necessary.

  I have some questions about the larger context here.  I agree that
it's important to understand the behaviour of the code, and to ferret
out possible bugs in the code (I will say that handling all of the
permutations of NA handling in the predict and simulate code is a big
headache ...)  However, I'm wondering about a few things:

 * is the difference between p=0.004 and p=0.006 _practically_
important?
 * for this example, it seems that (as you comment) almost any approach
will give reasonable answers -- simple LRT (because the number of levels
of the minimum-size grouping variable is >=40), Kenward-Roger (because
this is a LMM).  Parametric bootstrap feels like a waste of computation
time.
* I can't do it off the top of my head, but for a balanced/fully crossed
random effect, one might even be able to compute the denominator df
for the F distribution exactly.
* is there a reason you're not using refit() in method 2?  It should
be faster.
* does PBmodcomp() work with the latest version of lme4, with the
simulate/NA fix?  Or do we or the pbkrtest maintainers still have work
to do?
* I'm a little bit worried by method 2 -- I can imagine (although haven't
tested) that ysim may be a different length/may not match mod at frame,
which will be NA-processed. However, I think it should work OK
if the default na.action is 'na.omit' (I'd prefer that people _never_ used @
in their code ...)

I run parametric bootstraps in order to compare two models. I
bootstrap the likelihood ratio between the models. For anybody
acquainted with this, the code should be self explanatory. Model
selection is more or less done and these are exercises I like to
perform with real data sets to get a feel for how different
softwares treat the same models. Unfortunately, I am not in a
position to provide the real data. A few words about the models
though:

Both models are fit on a balanced data set which includes around 483 missing
values. There are slightly above 11196 observations (within: 70 subjects).
Every subject saw every item only one time. The dependent variable are reading
times; sentence_type and matching are crossed two-level factors. Context and
subject are treated as random effects. Context has 40 levels and (well you
guessed) subjects has 70 levels. (See this related post on:
http://stats.stackexchange.com/questions/94302/lmer-parametric-bootstrap-testing-for-fixed-effects):

Model formulae:
===============
        
# I only show the model formulae here because this piece stays fixed with
# every call to lmer(). The options change however, depending on the parametric  
# bootstrap procedure used. Any change can be seen in the code:
mod_min := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)
mod8    := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)

I set up different ways to perform parametric bootstraps. I will describe the
different ways and explain what I changed every time and show the results. The
ultimate question will be: Why the different results? Especially in one of the
ways to perform a parametric bootstrap.

  Making my own summary:

1. pbkrtest, na.action attribute removed. p=0.003
2. lmer refitting: p=0.0064
3. fitted on complete cases: p=0.0066
4. using na.exclude and refit: p = 0.0043
5. using complete cases and refit: p = 0.004

So I guess to boil it down, the difference is in the way that refit
handles missing values / accesses the model frame.  If I were going
to pursue this farther, I would probably try to strip it down and look
at the results of a _single_ refit with the random number seed set to
be identical in every case ...

Method 1 (pbkrtest).
====================

1.1. lmer() model calls:
------------------------

mod_min <- lmer(formula = log(value)
                ~ 1
                + (1 + satz_typ | vp)
                + (1 + satz_typ | kontext),
                data = wort12_lmm2_melt,
                REML = FALSE)

mod8 <- lmer(formula = log(value)
             ~ passung
             + (1 + satz_typ | vp)
             + (1 + satz_typ | kontext),
             data = wort12_lmm2_melt,
             REML = FALSE)

1.2. Parametric Bootstrap:
--------------------------

# remove NAs otherwise PBmodcomp() will complain as it calls the simulate()
# function directly without the argument na.action = na.exclude
# Relevant: See Ben Bolkers comments on this:
# https://github.com/lme4/lme4/issues/189 dirty fix:
mod8_nona    <- mod8
mod_min_nona <- mod_min

attr(x     = mod8_nona at frame, 
     which = "na.action") <- NULL # removing na.action attribute from the model

attr(x     = mod_min_nona at frame,
     which = "na.action") <- NULL # removing na.action attribute from the model

library(pbkrtest)

pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
                                smallModel = mod_min_nona,
                                nsim       = 10000,
                                cl         = cl)


Take home message PBmodcomp(): p.value = 0.003

1.3 Notable:
------------
na.action attribute has been removed from both models before simulate() call.


Method 2.
=========

2.1 lmer() model calls:
-----------------------

identical to Method 1 (pbkrtest).

2.2 Parametric Bootstrap:
-------------------------

mod <- mod8
modnull <- mod_min
# save the observed likelihood ratio test statistic
lrt_obs <- anova(modnull,
                 mod)$Chisq[2]

n_sim <- 10000

dattemp <- mod at frame

ptime <- system.time({
    lrt_sim <- foreach (i = 1:n_sim,
                        .combine  = "c",
                        .packages = "lme4") %dopar% {

        ysim       <- unlist(x = simulate(object = modnull))

        modnullsim <- lmer(formula = ysim
                           ~ 1
                           + (1 + satz_typ | vp)
                           + (1 + satz_typ | kontext),
                           data = dattemp,
                           REML = FALSE) # Fit the null model.

        modaltsim  <- lmer(formula = ysim
                           ~ passung
                           + (1 + satz_typ | vp)
                           + (1 + satz_typ | kontext),
                           data = dattemp,
                           REML = FALSE) # Fit the alternative model.

        anova(modnullsim,
              modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.

    }
})[3]

sum(lrt_sim >= lrt_obs) + 1)/(n_sim + 1) = 0.00639936 # p.value

Take home message Method 2: p.value = 0.00639936

2.3 Notable:
------------
na.action has not been altered!


Method 3.
=========
3.1 lmer() model calls:
-----------------------

# differences explained in 3.3
mod_min <- lmer(formula = log(value)
                ~ 1
                + (1 + satz_typ | vp)
                + (1 + satz_typ | kontext),
                data = wort12_lmm2_melt_noa,
                REML = FALSE)

mod8 <- lmer(formula = log(value)
             ~ passung
             + (1 + satz_typ | vp)
             + (1 + satz_typ | kontext),
             data = wort12_lmm2_melt_nona,
             REML = FALSE)

3.2 Parametric Bootstrap:
-------------------------

indentical to Method 2

Take home message Method 3: p.value = 0.00659934


3.3 Notable:
------------

The models were fit on the original dataset which only included complete cases.
That is, on a data set that was altered by the following code:

wort12_lmm2_melt_nona <-
    wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]


Method 4.
=========
4.1 lmer() model calls:
-----------------------

# differences explained in 4.3
mod_min_naexclude <- lmer(formula = log(value)
                          ~ 1
                          + (1 + satz_typ | vp)
                          + (1 + satz_typ | kontext),
                          data      = wort12_lmm2_melt,
                          REML      = FALSE,
                          na.action = na.exclude)

mod8_naexclude <- lmer(formula = log(value)
                       ~ passung
                       + (1 + satz_typ | vp)
                       + (1 + satz_typ | kontext),
                       data      = wort12_lmm2_melt,
                       REML      = FALSE,
                       na.action = na.exclude)


4.2 Parametric Bootstrap:
-------------------------
mod <- mod8_naexclude
modnull <- mod_min_naexclude

lrt_obs <- anova(modnull,
                 mod)$Chisq[2] # Save the observed likelihood ratio test statistic.

n_sim <- 10000

ptime <- system.time({
    lrt_sim <- foreach (i = 1:n_sim,
                        .combine   = "c",
                        .packages = "lme4") %dopar% {

        ysim       <- unlist(simulate(object = modnull))

        modnullsim <- refit(object  = modnull,
                            newresp = ysim) # Fit the null-model.

        modaltsim  <- refit(object  = mod,
                            newresp = ysim) # Fit the alternative model.

        anova(modnullsim,
              modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.

    }
})[3]

Take home message Method 4: p.value = 0.00429957


4.3 Notable:
------------
The models were fit with the option na.action = na.exclude.


Method 5.
=========

5.1 lmer() model calls:
-----------------------

# differences explained in 5.3
mod_min_naexclude <- lmer(formula = log(value)
                          ~ 1
                          + (1 + satz_typ | vp)
                          + (1 + satz_typ | kontext),
                          data      = wort12_lmm2_melt_nona,
                          REML      = FALSE)

mod8_naexclude <- lmer(formula = log(value)
                       ~ passung
                       + (1 + satz_typ | vp)
                       + (1 + satz_typ | kontext),
                       data      = wort12_lmm2_melt_nona,
                       REML      = FALSE)


5.2 Parametric Bootstrap:
-------------------------

indentical to Method 4

Take home message Method 5: p.value = 0.0039996

5.3 Notable:
------------
The models were fit on the original dataset which only included complete cases.
That is, on a data set that was altered by the following code:

wort12_lmm2_melt_nona <-
    wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]


Question:
=========

Why do Method 2 and Method 3 yield such (in my opinion) p.values that deviate
by about 0.002 from all other methods?

(I think at least the explanation of why I do not see it is really simple: I
just have been looking to long at this code and I am missing the obvious.)

Two final remarks:

(1) I ran everything on a cluster. That?s why I repeated Methods 2 and Methods
3. The results they yield seem to be numerically stable.

(2) I plotted the likelihood ratio test statistics for all methods and compared
them to a chi-squared distribution with df=1, and their likelihood ratio test
density. Everything is normal. The chi-square assumption is correct for these
models

Thank you so much for your patience!

Best,
Christian Brauner

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Why do extremely similar parametric bootstraps on lmer() objects yield (not only marginally) different results?

Thread (4 messages)