Why do extremely similar parametric bootstraps on lmer() objects yield (not only marginally) different results?
I ran some more tests and I can safely say that the culprit in Method 2 and Method 3 is the call to: dattemp <- mod at frame and subsequently using it as the data argument in the foreach() loop calls to lmer(). If I use the original data frame in Method 2 and Method 3 and set na.action = na.exclude when fitting the model the (significant) difference to using refit() disappears. It seems that copying only the data frame of an S4 object and using this data frame when fitting/refitting a new model with lmer() causes some sort of "behaviour" (I refrain from calling it trouble. I do not fully grasp what is happing there.). Best, Christian Eberhard Karls Universit?t T?bingen Mathematisch-Naturwissenschaftliche Fakult?t - Faculty of Science Evolutionary Cognition - Cognitive Science Schleichstra?e 4 ? 72076 T?bingen ? Germany Telefon +49 7071 29-75643 ? Telefax +49 7071 29-4721 christian.brauner at uni-tuebingen.de
On Sat, May 03, 2014 at 09:38:30PM +0200, Christian Brauner wrote:
* is the difference between p=0.004 and p=0.006 _practically_ important?
No, I don?t think that it is important at all. Mostly because I don?t rely on p values. (But a lot of people still like to see them.) In this special case it is not important because we are looking at a p that is significantly smaller than the specified alpha level; in this case the alpha level is the abundant 0.05. Still, the results differ at least by 0.002. This phenomenon seems stable across different ways to treat NAs (as I?ve tried to show). I think this suggests that one of the methods has a systematical bias. But I don?t know which method. Hence, I don?t know if it is an anticonservative or a conservative bias. It was pure curiosity which of the two methods it is. Any hints on how to find out are welcome. (I saw you suggested looking at a single refit while setting seed.)
* for this example, it seems that (as you comment) almost any approach will give reasonable answers -- simple LRT (because the number of levels of the minimum-size grouping variable is >=40), Kenward-Roger (because this is a LMM). Parametric bootstrap feels like a waste of computation time.
Agreed. But the computation time didn?t matter much to me since all these things are run on a grid at 5 in the morning and finish in 30 min. Usually nobody is doing anyhting on it at that time.
* I can't do it off the top of my head, but for a balanced/fully crossed random effect, one might even be able to compute the denominator df for the F distribution exactly.
I actually did not know that. If you have any papers, code or explanatio that you could share with me I would greatly appreciate it!
* is there a reason you're not using refit() in method 2? It should be faster.
I am usually only comfortable with calling in functions which I had the time to understand (reasonably) properly (e.g. by looking at its code.) and of which I know how they're doing what they're doing. In this manner I can make sure that I can stand for the things I calculated. I am not that acquainted with refit() so I was up until now relying on doing stuff by hand. Furthermore, I like to do stuff by hand first. This is just to check if I know what I am doing.
* does PBmodcomp() work with the latest version of lme4, with the simulate/NA fix? Or do we or the pbkrtest maintainers still have work to do?
I just did a test with pbkrtest_0.3-8 and the github version of lme4_1.1-7. If
I specify my models:
mod8 <- lmer(formula = log(value)
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE)
mod_min <- lmer(formula = log(value)
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE)
and call:
pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
smallModel = mod_min_nona,
nsim = 100,
cl = cl)
without removing the na.action attribute, I get:
pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
+ smallModel = mod_min_nona, + nsim = 100, + cl = cl) Error in checkForRemoteErrors(lapply(cl, recvResult)) : one node produced an error: replacement has 10263 rows, data has 10713
* I'm a little bit worried by method 2 -- I can imagine (although haven't tested) that ysim may be a different length/may not match mod at frame, which will be NA-processed. However, I think it should work OK if the default na.action is 'na.omit' (I'd prefer that people _never_ used @ in their code ...)
That actually was a problem. If I fit the models with lmer(..., na.action = na.exclude) but leave the rest the way it is. I will get a model.frame.default error message which has to do with differing lengths. That?s why I redid Method 2 with Method 3 were I constructed a data frame with complete cases. Thank you so much for taking a look at this! Christian Eberhard Karls Universit?t T?bingen Mathematisch-Naturwissenschaftliche Fakult?t - Faculty of Science Evolutionary Cognition - Cognitive Science Schleichstra?e 4 ? 72076 T?bingen ? Germany Telefon +49 7071 29-75643 ? Telefax +49 7071 29-4721 christian.brauner at uni-tuebingen.de On Sat, May 03, 2014 at 02:14:30PM -0400, Ben Bolker wrote:
On 14-05-03 07:59 AM, Christian Brauner wrote:
Hello all,
warning ahead: This will be quite a long post. But have faith, it is necessary.
I have some questions about the larger context here. I agree that it's important to understand the behaviour of the code, and to ferret out possible bugs in the code (I will say that handling all of the permutations of NA handling in the predict and simulate code is a big headache ...) However, I'm wondering about a few things: * is the difference between p=0.004 and p=0.006 _practically_ important? * for this example, it seems that (as you comment) almost any approach will give reasonable answers -- simple LRT (because the number of levels of the minimum-size grouping variable is >=40), Kenward-Roger (because this is a LMM). Parametric bootstrap feels like a waste of computation time. * I can't do it off the top of my head, but for a balanced/fully crossed random effect, one might even be able to compute the denominator df for the F distribution exactly. * is there a reason you're not using refit() in method 2? It should be faster. * does PBmodcomp() work with the latest version of lme4, with the simulate/NA fix? Or do we or the pbkrtest maintainers still have work to do? * I'm a little bit worried by method 2 -- I can imagine (although haven't tested) that ysim may be a different length/may not match mod at frame, which will be NA-processed. However, I think it should work OK if the default na.action is 'na.omit' (I'd prefer that people _never_ used @ in their code ...)
I run parametric bootstraps in order to compare two models. I bootstrap the likelihood ratio between the models. For anybody acquainted with this, the code should be self explanatory. Model selection is more or less done and these are exercises I like to perform with real data sets to get a feel for how different softwares treat the same models. Unfortunately, I am not in a position to provide the real data. A few words about the models though:
Both models are fit on a balanced data set which includes around 483 missing values. There are slightly above 11196 observations (within: 70 subjects). Every subject saw every item only one time. The dependent variable are reading times; sentence_type and matching are crossed two-level factors. Context and subject are treated as random effects. Context has 40 levels and (well you guessed) subjects has 70 levels. (See this related post on: http://stats.stackexchange.com/questions/94302/lmer-parametric-bootstrap-testing-for-fixed-effects): Model formulae: =============== # I only show the model formulae here because this piece stays fixed with # every call to lmer(). The options change however, depending on the parametric # bootstrap procedure used. Any change can be seen in the code: mod_min := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext) mod8 := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext) I set up different ways to perform parametric bootstraps. I will describe the different ways and explain what I changed every time and show the results. The ultimate question will be: Why the different results? Especially in one of the ways to perform a parametric bootstrap.
Making my own summary: 1. pbkrtest, na.action attribute removed. p=0.003 2. lmer refitting: p=0.0064 3. fitted on complete cases: p=0.0066 4. using na.exclude and refit: p = 0.0043 5. using complete cases and refit: p = 0.004 So I guess to boil it down, the difference is in the way that refit handles missing values / accesses the model frame. If I were going to pursue this farther, I would probably try to strip it down and look at the results of a _single_ refit with the random number seed set to be identical in every case ...
Method 1 (pbkrtest).
====================
1.1. lmer() model calls:
------------------------
mod_min <- lmer(formula = log(value)
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE)
mod8 <- lmer(formula = log(value)
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE)
1.2. Parametric Bootstrap:
--------------------------
# remove NAs otherwise PBmodcomp() will complain as it calls the simulate()
# function directly without the argument na.action = na.exclude
# Relevant: See Ben Bolkers comments on this:
# https://github.com/lme4/lme4/issues/189 dirty fix:
mod8_nona <- mod8
mod_min_nona <- mod_min
attr(x = mod8_nona at frame,
which = "na.action") <- NULL # removing na.action attribute from the model
attr(x = mod_min_nona at frame,
which = "na.action") <- NULL # removing na.action attribute from the model
library(pbkrtest)
pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
smallModel = mod_min_nona,
nsim = 10000,
cl = cl)
Take home message PBmodcomp(): p.value = 0.003
1.3 Notable:
------------
na.action attribute has been removed from both models before simulate() call.
Method 2.
=========
2.1 lmer() model calls:
-----------------------
identical to Method 1 (pbkrtest).
2.2 Parametric Bootstrap:
-------------------------
mod <- mod8
modnull <- mod_min
# save the observed likelihood ratio test statistic
lrt_obs <- anova(modnull,
mod)$Chisq[2]
n_sim <- 10000
dattemp <- mod at frame
ptime <- system.time({
lrt_sim <- foreach (i = 1:n_sim,
.combine = "c",
.packages = "lme4") %dopar% {
ysim <- unlist(x = simulate(object = modnull))
modnullsim <- lmer(formula = ysim
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = dattemp,
REML = FALSE) # Fit the null model.
modaltsim <- lmer(formula = ysim
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = dattemp,
REML = FALSE) # Fit the alternative model.
anova(modnullsim,
modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
}
})[3]
sum(lrt_sim >= lrt_obs) + 1)/(n_sim + 1) = 0.00639936 # p.value
Take home message Method 2: p.value = 0.00639936
2.3 Notable:
------------
na.action has not been altered!
Method 3.
=========
3.1 lmer() model calls:
-----------------------
# differences explained in 3.3
mod_min <- lmer(formula = log(value)
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt_noa,
REML = FALSE)
mod8 <- lmer(formula = log(value)
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt_nona,
REML = FALSE)
3.2 Parametric Bootstrap:
-------------------------
indentical to Method 2
Take home message Method 3: p.value = 0.00659934
3.3 Notable:
------------
The models were fit on the original dataset which only included complete cases.
That is, on a data set that was altered by the following code:
wort12_lmm2_melt_nona <-
wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
Method 4.
=========
4.1 lmer() model calls:
-----------------------
# differences explained in 4.3
mod_min_naexclude <- lmer(formula = log(value)
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE,
na.action = na.exclude)
mod8_naexclude <- lmer(formula = log(value)
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt,
REML = FALSE,
na.action = na.exclude)
4.2 Parametric Bootstrap:
-------------------------
mod <- mod8_naexclude
modnull <- mod_min_naexclude
lrt_obs <- anova(modnull,
mod)$Chisq[2] # Save the observed likelihood ratio test statistic.
n_sim <- 10000
ptime <- system.time({
lrt_sim <- foreach (i = 1:n_sim,
.combine = "c",
.packages = "lme4") %dopar% {
ysim <- unlist(simulate(object = modnull))
modnullsim <- refit(object = modnull,
newresp = ysim) # Fit the null-model.
modaltsim <- refit(object = mod,
newresp = ysim) # Fit the alternative model.
anova(modnullsim,
modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
}
})[3]
Take home message Method 4: p.value = 0.00429957
4.3 Notable:
------------
The models were fit with the option na.action = na.exclude.
Method 5.
=========
5.1 lmer() model calls:
-----------------------
# differences explained in 5.3
mod_min_naexclude <- lmer(formula = log(value)
~ 1
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt_nona,
REML = FALSE)
mod8_naexclude <- lmer(formula = log(value)
~ passung
+ (1 + satz_typ | vp)
+ (1 + satz_typ | kontext),
data = wort12_lmm2_melt_nona,
REML = FALSE)
5.2 Parametric Bootstrap:
-------------------------
indentical to Method 4
Take home message Method 5: p.value = 0.0039996
5.3 Notable:
------------
The models were fit on the original dataset which only included complete cases.
That is, on a data set that was altered by the following code:
wort12_lmm2_melt_nona <-
wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
Question:
=========
Why do Method 2 and Method 3 yield such (in my opinion) p.values that deviate
by about 0.002 from all other methods?
(I think at least the explanation of why I do not see it is really simple: I
just have been looking to long at this code and I am missing the obvious.)
Two final remarks:
(1) I ran everything on a cluster. That?s why I repeated Methods 2 and Methods
3. The results they yield seem to be numerically stable.
(2) I plotted the likelihood ratio test statistics for all methods and compared
them to a chi-squared distribution with df=1, and their likelihood ratio test
density. Everything is normal. The chi-square assumption is correct for these
models
Thank you so much for your patience!
Best,
Christian Brauner
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models