Hi,
I have a random-intercept logistic GLMM fit to a dataset of 650
observations. The model discriminates the binary outcomes of the dataset
very nicely (C-index aka ROC AUC equals .85). But a reviewer suggested that
in a dataset this small, the model's discrimination performance should be
tested through a bootstrap procedure.
My script draws a bootstrap sample from the dataset, fits the model to it,
then uses the somers2() function from the Hmisc library to calculate the
C-index of the boostrap-fit model when used to discriminate the original
data. This is repeated 1000 times:
# The original dataset is called d
require(lme4)
require(Hmisc)
n <- 1000
c.index <- numeric(length = n)
set.seed(2019)
for(i in 1:n){
bootsample <- d[sample(nrow(d), replace = T),]
tmpmod <- update(MyModel, ~., data = bootsample)
c.index[i] <- somers2(predict(tmpmod, newdata = d, re.form = NULL,
allow.new.levels = TRUE), d$dep.var)["C"]
}
It turns out that at .854, average discrimination in 1000 bootstrap
iterations is slightly Higher than the original model's in-sample
discrimination (.85). There must be an error, no? Surely the out-of-sample
performance of any model should be worse, on average, than its in-sample
performance?
My actual code includes additional, fairly messy bits which attempt to
refit non-converging models using alternative optimizers (from the optimx
package) and, failing that, draws a new bootstrap sample if the model
cannot be fit to the first bootstrap sample for that iteration. In total,
there were 128 bootstrap samples for which the model failed to converge,
necessitating an alternative bootstrap sample. My suspicion is that this is
the reason behind the too-good-to-be-true results -- the exclusion of those
bootstrap 128 samples for which the model failed to converge has introduced
an optimistic bias into the bootstrap sampling, such that favorable
bootstrap samples are overrepresented and unfavorable ones
underrepresented. The question is: since it is impossible to achieve
convergence in every bootstrap sample, is there some heuristic for
estimating the degree of optimism introduced by bootstrap samples that had
to be discarded due to non-convergence?
Best,
J
Optimism introduced by non-converging models in bootstrap validation of GLMM (lme4)
2 messages · Juho Kristian Ruohonen, Ben Bolker
Quick thought: "non-convergence" doesn't necessarily mean the fit is actually bad (false positives blah blah blah), and in most (all?) cases you actually get a working fitted model (i.e., you could get the C-index). If you compare the distribution of C-indices from converging and non-converging models, what do you see? Also: is bootstrapping at the level of observations OK, or should you be bootstrapping at the level of groups (or hierarchically, i.e. both among and within groups)? I'm also not 100% sure that the discrimination must be lower in the bootstrap samples -- among other things, the model is *not* maximizing discrimination, it's maximizing likelihood (which are presumably asymptotically equivalent but needn't be identical ... ?)
On 2019-05-12 1:26 p.m., Juho Kristian Ruohonen wrote:
Hi,
I have a random-intercept logistic GLMM fit to a dataset of 650
observations. The model discriminates the binary outcomes of the dataset
very nicely (C-index aka ROC AUC equals .85). But a reviewer suggested that
in a dataset this small, the model's discrimination performance should be
tested through a bootstrap procedure.
My script draws a bootstrap sample from the dataset, fits the model to it,
then uses the somers2() function from the Hmisc library to calculate the
C-index of the boostrap-fit model when used to discriminate the original
data. This is repeated 1000 times:
# The original dataset is called d
require(lme4)
require(Hmisc)
n <- 1000
c.index <- numeric(length = n)
set.seed(2019)
for(i in 1:n){
bootsample <- d[sample(nrow(d), replace = T),]
tmpmod <- update(MyModel, ~., data = bootsample)
c.index[i] <- somers2(predict(tmpmod, newdata = d, re.form = NULL,
allow.new.levels = TRUE), d$dep.var)["C"]
}
It turns out that at .854, average discrimination in 1000 bootstrap
iterations is slightly Higher than the original model's in-sample
discrimination (.85). There must be an error, no? Surely the out-of-sample
performance of any model should be worse, on average, than its in-sample
performance?
My actual code includes additional, fairly messy bits which attempt to
refit non-converging models using alternative optimizers (from the optimx
package) and, failing that, draws a new bootstrap sample if the model
cannot be fit to the first bootstrap sample for that iteration. In total,
there were 128 bootstrap samples for which the model failed to converge,
necessitating an alternative bootstrap sample. My suspicion is that this is
the reason behind the too-good-to-be-true results -- the exclusion of those
bootstrap 128 samples for which the model failed to converge has introduced
an optimistic bias into the bootstrap sampling, such that favorable
bootstrap samples are overrepresented and unfavorable ones
underrepresented. The question is: since it is impossible to achieve
convergence in every bootstrap sample, is there some heuristic for
estimating the degree of optimism introduced by bootstrap samples that had
to be discarded due to non-convergence?
Best,
J
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models