Skip to content

binomial fixed-effect p-values by simulation

18 messages · Ken Beath, Daniel Ezra Johnson, Ben Bolker +2 more

#
Consider the common situation of modeling multiple responses from
multiple subjects, where we want to know whether a fixed-effect
between-subjects variable -- for example, gender -- has a significant
effect.

In my field (quantitative sociolinguistics) the response is usually
binomial, and for some 30 years the usual procedure has been to carry
out logistic regression, setting the subject variable aside, and
assessing fixed effect significance with a likelihood-ratio test. When
between-subject variation is high, this produces exceptionally
anti-conservative p-values (see old.p.value below).

I am writing an article and some wrapper functions designed to
introduce the use of mixed models to this audience. With subject
included as a random effect, the p-values for fixed between-subject
effects are much more conservative, even using the likelihood-ratio
test (see lrt.p.value below). In other words, this is an improvement.

But as has been discussed many times on this list, likelihood-ratio
tests are still anti-conservative for fixed effect significance. I
believe the extent of anti-conservatism depends on the number of
observations relative to the number of groups, the balance of the
data, and probably other factors too.

I have seen the MCMC method discussed, but I think this is not yet
available for binomial data with lmer(), is that correct? In addition,
that method, which operates in the parameter space, seems to be
testing a slightly different null hypothesis than the usual
frequentist one, though I may be misunderstanding this.

Would it be appropriate for me to carry out a simulation of the type
given below, to assess fixed effect significance?

The code starts by generating a dataset which happens to have a
potentially significant gender effect, according to the LRT. (In fact
the data is randomly generated and has no "real" gender effect, but
assume it was real data and we did not know that.)

A null model (no fixed effect) and an alternative model (fixed gender
effect) are fit to the data by lmer(), using a random effect for
subject in both cases.

It then takes the two parameters from the null model -- intercept and
subject s.d. -- and randomly generates new data. The simulated data is
fit with the alternative-style model and the proportion of times the
fixed effect magnitude exceeds that from the 'observed' data is the
p-value.

Does this procedure make sense? It would generalize to linear models
easily. Making it work for unbalanced data and when there are other
fixed effects in the model would not be that hard either, I think.

To my mind, this is a direct implementation of the frequentist p-value
concept via simulation. Am I missing something important?

Of course, especially with binomial data this simulation is quite slow
(i.e. compared to the unimplemented MCMC method). The code below, with
1000 simulations of a 2000-observation dataset, took 15 minutes to run
on a MacBook Pro. (The result was a p-value of 0.027, compared to
0.015 from the LRT, 0.009 from the Wald statistic... and 1.1e-56 from
the fixed-effects model!)

It's important to me to be able to give p-values that are
accurate/conservative and that I understand. So if anyone can give me
guidance on whether this simulation procedure makes sense, it would be
much appreciated.

And if there's a faster way to do something like this for binomial
data, please let me know.

Thanks,
Daniel

library(lme4)
library(boot)
set.seed(10)

observed <- data.frame(subject = rep(letters[1:20],each=100),
	gender = rep(c("M","F"),each=1000),
	response = rbinom(2000,1,rep(plogis(rnorm(20,0,3)),each=100)))

old <- glm(response~gender,binomial,observed)
old.p.value <- pchisq(old$null.deviance-old$deviance,df=1,lower.tail=F)
# 1.083e-56

null <- lmer(response ~ (1|subject),observed,binomial)
fixed <- lmer(response ~ gender + (1|subject),observed,binomial)

wald.p.value <-  summary(fixed)@coefs["genderM","Pr(>|z|)"] # 0.00894
lrt.p.value <- anova(null,fixed)[,"Pr(>Chisq)"][2] # 0.0152

null.intercept <- fixef(null) # -0.0914
null.subject.sd <- attr(VarCorr(null)$subject,"stddev") # 2.276
gender.observed <- fixef(fixed)[2] # -2.319
gender.simulated <- vector()

for (i in 1:1000){
simulated <- data.frame(subject = rep(letters[1:20],each=100),
	gender = rep(c("M","F"),each=1000),
	response = rbinom(2000,1,rep(plogis(rnorm(20,0,null.subject.sd)),each=100)))
fixed.sim <- lmer(response~gender+(1|subject),simulated,binomial)
gender.simulated[i] <- fixef(fixed.sim)[2]}

# minutes later...

simulation.p.value <- mean(abs(gender.simulated)>=abs(gender.observed)) # 0.027
#
I just wrote a "simulate" method for glmer objects, to extend the
version that's already in lme4 for lmer objects.  Given a nested pair
of models m0, m1, the simulation distribution of t-statistics would
be

replicate(1000,get.t.statistic(refit(m1,simulate(m0))))

See simulate.lme in nlme, which also has a plot method
that displays the nominal and empirical p-values for REML
and ML estimates. 

  The bad news is that it's essentially doing the same thing you
discuss below -- and therefore it's not any faster (I was able
to do about 10 replicates/minute ...)

  If you're interested let me know and I can send my
"glmersim.R"
 
  Ben Bolker
Daniel Ezra Johnson wrote:
#
Daniel,

This is almost parametric bootstrapping. Rather than looking at the  
distribution of the parameter estimates it looks at the distribution  
of the test statistic (z value) and should be an improvement. I'm not  
certain of the correct way but summary(fixed.sim)@coefs will return  
them.

Unless the sample size is small I wouldn't be particularly worried  
about the distribution of the fixed effects estiamtes and I expect in  
your example there wont be much difference.

Another option is to use the sandwich estimator with logistic  
regression.

Ken
On 23/08/2008, at 8:22 AM, Daniel Ezra Johnson wrote:

            
#
My thanks to Daniel for an interesting topic for discussion on this
list.  Ben had earlier sent me private email with his code for
simulation from a glmer fit and i think that the first thing i should
do is to incorporate that code, with thanks to Ben, into the lme4
package so we can do these simulations more easily and carry on this
informative discussion.
On Fri, Aug 22, 2008 at 7:45 PM, Ken Beath <ken at kjbeath.com.au> wrote:
It may be somewhat faster to take the value of fixef(fixed.sim) and
vcov(fixed.sim) and use those to create the z statistics.  The z
statistics should be

fixef(fixed.sim)/sqrt(diag(vcov(fixed.sim)))

This will avoid creating the whole summary object.
#
What would be the difference between simulating the z-statistic (if
I'm getting this, it would be determining what proportion of the
simulations have a z-statistic as large as the one from the observed
data) versus doing the same thing with the difference of
log-likelihoods)?

One difference I see is that with the z-statistic approach there is no
need to fit a null model, only the alternative model (to data
generated by the null model)?

D
#
On 24/08/2008, at 2:16 AM, Daniel Ezra Johnson wrote:

            
They should produce similar results and which is better is probably of  
theoretical rather than practical interest.

  I will try this with a commercial program I have that does both.

Ken
#
We read, e.g. in Pinheiro and Bates, that one situation where
fixed-effect LRTs are anti-conservative is when the number of fixed
effect parameters being tested is large with respect to the number of
groups.

In the tests I'm doing, there's only a single binary fixed effect
factor being tested - a between-subjects factor like gender, as noted
earlier.

I'm finding evidence for pretty serious anti-conservatism here too
(e.g. LRT chi-square p=0.0001 vs. LRT simulation p=0.016), and I'm
working on some reproducible code to demonstrate this.
#
The other criterion is that the number of blocks has to be large.
I have seen *no* rules of thumb for how large ... in the presentation
that Doug Bates posted about recently ( 
http://www.stat.wisc.edu/~bates/PotsdamGLMM/GLMMD.pdf )
p. 32, 1934 observations, 60 groups, he uses LRT ... (you could try running
your stuff on his example -- I think all the code etc is available from 
his web site)
to see how well this works -- although here he gets p=0.796, so 
anti-conservative
would just make that larger ...

  Ben
Daniel Ezra Johnson wrote:
#
A quick update on my progress implementing the simulation significance
test. If anyone is interested I can try to provide reproducible
examples illustrating these points.

1) I found that occasionally, refit(model.mer,response.vector)
produced a different result than glmer(response.vector ~ ...). My
solution is to use glmer() rather than refit() for each simulation,
and it doesn't appear to make it that much slower.

2) I found that the LRT approach does not seem to work as well as
simulating the z-statistic. In the LRT approach, occasionally either
the null model or the alternative model would come out with a "crazy"
log-likelihood -- e.g. one LL would be -400, the other -21000 -- and
this happened often enough in the right direction to throw off the
empirical p-values. So I'm going ahead with the z-statistic approach
(the same datasets don't give a "crazy" z-statistic).

3) There's still the problem of what to do when the simulated model
does not converge or has a false convergence - whether or not it's
biasing the empirical p-value to throw away those runs. It probably
is, in that whether or not the false convergence occurs is probably
not unrelated to the size of the fixed effect. But it doesn't appear
to be a direct relationship, so I'm comfortable tossing those runs for
now.

Probably all of these things are happening more often than they
usually would because I'm testing this on a rather extreme type of
data: binomial, 20 subjects with a large subject standard deviation
(5) and only 100 observations per subject, so there end up being a lot
of subjects with invariant responses, 0% or 100%.

And since my "observed" data has a Wald fixed effect p-value of 0.0005
(the p-value from the z-score), it's not that easy to judge whether
the simulation is slightly more conservative in a reasonable amount of
computing time!

Well, 1000 runs just finished and I got 2 where abs(z) >= z.obs, so
empirical p is 0.002.

I can provide more details if anyone is interested. Thanks again for
your help with this.

Daniel
#
Sorry if this has been covered elsewhere, but if my interest is in
testing a single fixed effect _term_ (all coefficients at once) is
there an appropriate statistic to simulate for a binomial model?

In other words, if I fit a linear model "glmodel" I can simulate one
of the F-statistics from anova(glmodel). If there's only one
coefficient for the term then F = t^2...

If I have a "glmmodel" I can do anova(glmmodel) but I wanted to make
sure the F-statistic reported there was a sensible thing to look at
since it wasn't quite the square of the z-statistic in the simple
case.

Maybe it doesn't have an F-distribution but it would still work well
as a single-number stand-in for the 'size of a fixed effect' in a
simulation...

Thanks,
D
#
On Mon, Aug 25, 2008 at 3:38 AM, Daniel Ezra Johnson
<danielezrajohnson at gmail.com> wrote:
Is there an F statistic reported in anova(glmmodel)?  I had a bug in
the code that allowed such a calculation in some circumstances but
Antje Nuthmann pointed out to me  that it was incorrect and could
provoke an obscure error message.

The good news is that I changed the single-argument form of the anova
method in 0.999375-25 so the wrong results are not reported.  The bad
news is that no results are reported.  For a binomial or Poisson GLMM
I think the appropriate statistic would be a chi-square but I haven't
thought of exactly how it would be calculated.
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

  My guess, based on Littell et al 2003, would be that
something like  p %*% V^{-1} %*% p would give you a
quadratic form that would be chi-squared distributed with
rank(V) df, or (with an estimated scale parameter) F-distributed
with (rank(V), n-whatever) df -- or at least nominally
so, and if you're going to simulate anyway you're going
to find out how it's _really_ distributed ...

[I have a glmer fit named "zz", and a factor named "status"
that I want to test all levels == 0 simultaneously]

params <- grep("^status",names(fixef(zz)))
fixef(zz)[params] %*% solve(vcov(zz)[params,params]) %*% fixef(zz)[params]

   cheers
    Ben
Daniel Ezra Johnson wrote:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsroic5UpGjwzenMRAncaAJ9y+Fza6/kU3ya/Bkd699i81/mDPQCfW4DA
YNHMEw6/cFePwAdvJN6mL7s=
=x8iG
-----END PGP SIGNATURE-----
#
On Mon, Aug 25, 2008 at 8:56 AM, Ben Bolker <bolker at ufl.edu> wrote:
(covers face and runs away screaming).

Umm, please don't use grep on names to determine which coefficients
are associated with a given factor.  That's what the terms and assign
attributes are for.

You split the fixed effects according to assign, as in the code for
the anova method.  For example
Classes 'terms', 'formula' length 3 r2 ~ btype + situ + mode + Gender * Anger
  ..- attr(*, "variables")= language list(r2, btype, situ, mode, Gender, Anger)
  ..- attr(*, "factors")= int [1:6, 1:6] 0 1 0 0 0 0 0 0 1 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:6] "r2" "btype" "situ" "mode" ...
  .. .. ..$ : chr [1:6] "btype" "situ" "mode" "Gender" ...
  ..- attr(*, "term.labels")= chr [1:6] "btype" "situ" "mode" "Gender" ...
  ..- attr(*, "order")= int [1:6] 1 1 1 1 1 2
  ..- attr(*, "intercept")= int 1
  ..- attr(*, "response")= int 1
  ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
  ..- attr(*, "predvars")= language list(r2, btype, situ, mode, Gender, Anger)
  ..- attr(*, "dataClasses")= Named chr [1:6] "factor" "factor"
"factor" "factor" ...
  .. ..- attr(*, "names")= chr [1:6] "r2" "btype" "situ" "mode" ...
[1] 0 1 1 2 3 4 5 6

The assign attribute indicates that the first coefficient is the
intercept, the next two are associated with the first-order term
"btype", the next is associated with the first-order term "situ", ...,
the eighth is associated with the second-order term "Gender:Anger".
#
I was wondering if anybody could explain why an AIC can be calucluated for a lmer model with a quasi poisson distriution and not a glm.
I have also found that the AIC's are extremely sensitive in the lmer model, for example removing a term with very low t-values resulted in an increase in AIC of 284.
Many thanks,
Anna

MODEL1:

mix<-lmer(realdis~sex+width+sess+sex:width+sex:sess+sess:width+(1|site),family=quasipoisson,data=move,subset=-52)
Generalized linear mixed model fit by the Laplace approximation
Formula: realdis ~ sex + width + sess + sex:width + sex:sess + sess:width +      (1 | site)
   Data: move
 Subset: -52
   AIC   BIC logLik deviance
 11019 11117  -5490    10979
Random effects:
 Groups   Name        Variance Std.Dev.
 site     (Intercept)  9.1673  3.0278
 Residual             76.4532  8.7438
Number of obs: 976, groups: site, 14

Fixed effects:
             Estimate Std. Error t value
(Intercept)   2.76589    0.91021  3.0387
sexm          0.17425    0.41856  0.4163
widthn        0.26750    0.99495  0.2689
widthw       -0.47361    0.55549 -0.8526
sess2        -0.43403    0.64479 -0.6731
sess3        -0.16228    0.85141 -0.1906
sess4        -0.40075    0.63427 -0.6318
sexm:widthn   0.78322    0.63070  1.2418
sexm:widthw   0.39836    0.44387  0.8975
sexm:sess2    0.02061    0.51386  0.0401
sexm:sess3    0.17900    0.52713  0.3396
sexm:sess4    0.36188    0.53644  0.6746
widthn:sess2  0.53589    0.83917  0.6386
widthw:sess2 -0.11500    0.65798 -0.1748
widthn:sess3 -0.19799    1.03927 -0.1905
widthw:sess3 -0.99033    0.86773 -1.1413
widthn:sess4 -0.17498    0.94033 -0.1861
widthw:sess4 -1.03199    0.75236 -1.3717

REMOVE WIDTH:SESS INTERACTION
 mix1<-lmer(realdis~sex+width+sess+sex:width+sex:sess+(1|site),family=quasipoisson,data=move,subset=-52)
Generalized linear mixed model fit by the Laplace approximation
Formula: realdis ~ sex + width + sess + sex:width + sex:sess + (1 | site)
   Data: move
 Subset: -52
   AIC   BIC logLik deviance
 11303 11372  -5638    11275
Random effects:
 Groups   Name        Variance Std.Dev.
 site     (Intercept)  6.8207  2.6117
 Residual             84.5884  9.1972
Number of obs: 976, groups: site, 14

Fixed effects:
            Estimate Std. Error t value
(Intercept)   3.0503     0.7807   3.907
sexm          0.1780     0.4249   0.419
widthn        0.1302     0.7417   0.176
widthw       -0.7784     0.5217  -1.492
sess2        -0.4633     0.4770  -0.971
sess3        -0.9002     0.4859  -1.853
sess4        -0.9513     0.4733  -2.010
sexm:widthn   0.7431     0.6553   1.134
sexm:widthw   0.3856     0.4592   0.840
sexm:sess2    0.0963     0.5314   0.181
sexm:sess3    0.2302     0.5511   0.418
sexm:sess4    0.3524     0.5481   0.643

Change in AIC =  11303 - 11019 = an increase of 284


The University of Aberdeen is a charity registered in Scotland, No SC013683.
#
I wouldn't put too much faith in the values of the AIC etc. for the
quasipoisson family.  I included that family as a convenience and I
believe that the parameter estimates make sense but I don't know about
the deviance and log-likelihood values.  I'm still working out the
effect of the common scale parameter in that family.
On Mon, Aug 25, 2008 at 10:46 AM, Renwick, A. R. <a.renwick at abdn.ac.uk> wrote:
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Douglas Bates wrote:
Sorry about the running-away-screaming part.
  I'm just happy that you don't disagree that vehemently with the
statistical (as opposed to the R-programming) part of my advice ...

  thanks
   Ben
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIsuvCc5UpGjwzenMRAi0sAKCDy5ktn0LdgUNYSUCstEbpEqMa5gCfUzlt
gOv6ckTkE8V7dSMzPildRBI=
=vTxi
-----END PGP SIGNATURE-----
#
sorry for the naive question - how can one inspect the code for a
method rather than a function - like the anova method for lmer?
D
#
On Tue, Aug 26, 2008 at 9:19 AM, Daniel Ezra Johnson
<danielezrajohnson at gmail.com> wrote:
The simple way is to check the files in the source code package,
because they include the comments as well as the code.  The source
code archive is stored at http://r-forge.r-projects.org/projects/lme4
under the SCM tab.  A direct URL is

http://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/?root=lme4

The most recent version of the source code for essentially everything
associated with mer objects is in the file pkg/R/lmer.R

In general if you want to examine methods you first decide if you have
an S3 method or an S4 method.  S3 method names are returned by

methods(anova)

and S4 method names by

showMethods("anova")

To display the code for a particular S4 method use

getMethod("anova", "mer")