binomial fixed-effect p-values by simulation

Consider the common situation of modeling multiple responses from
multiple subjects, where we want to know whether a fixed-effect
between-subjects variable -- for example, gender -- has a significant
effect.

In my field (quantitative sociolinguistics) the response is usually
binomial, and for some 30 years the usual procedure has been to carry
out logistic regression, setting the subject variable aside, and
assessing fixed effect significance with a likelihood-ratio test. When
between-subject variation is high, this produces exceptionally
anti-conservative p-values (see old.p.value below).

I am writing an article and some wrapper functions designed to
introduce the use of mixed models to this audience. With subject
included as a random effect, the p-values for fixed between-subject
effects are much more conservative, even using the likelihood-ratio
test (see lrt.p.value below). In other words, this is an improvement.

But as has been discussed many times on this list, likelihood-ratio
tests are still anti-conservative for fixed effect significance. I
believe the extent of anti-conservatism depends on the number of
observations relative to the number of groups, the balance of the
data, and probably other factors too.

I have seen the MCMC method discussed, but I think this is not yet
available for binomial data with lmer(), is that correct? In addition,
that method, which operates in the parameter space, seems to be
testing a slightly different null hypothesis than the usual
frequentist one, though I may be misunderstanding this.

Would it be appropriate for me to carry out a simulation of the type
given below, to assess fixed effect significance?

The code starts by generating a dataset which happens to have a
potentially significant gender effect, according to the LRT. (In fact
the data is randomly generated and has no "real" gender effect, but
assume it was real data and we did not know that.)

A null model (no fixed effect) and an alternative model (fixed gender
effect) are fit to the data by lmer(), using a random effect for
subject in both cases.

It then takes the two parameters from the null model -- intercept and
subject s.d. -- and randomly generates new data. The simulated data is
fit with the alternative-style model and the proportion of times the
fixed effect magnitude exceeds that from the 'observed' data is the
p-value.

Does this procedure make sense? It would generalize to linear models
easily. Making it work for unbalanced data and when there are other
fixed effects in the model would not be that hard either, I think.

To my mind, this is a direct implementation of the frequentist p-value
concept via simulation. Am I missing something important?

Of course, especially with binomial data this simulation is quite slow
(i.e. compared to the unimplemented MCMC method). The code below, with
1000 simulations of a 2000-observation dataset, took 15 minutes to run
on a MacBook Pro. (The result was a p-value of 0.027, compared to
0.015 from the LRT, 0.009 from the Wald statistic... and 1.1e-56 from
the fixed-effects model!)

It's important to me to be able to give p-values that are
accurate/conservative and that I understand. So if anyone can give me
guidance on whether this simulation procedure makes sense, it would be
much appreciated.

And if there's a faster way to do something like this for binomial
data, please let me know.

Thanks,
Daniel

library(lme4)
library(boot)
set.seed(10)

observed <- data.frame(subject = rep(letters[1:20],each=100),
	gender = rep(c("M","F"),each=1000),
	response = rbinom(2000,1,rep(plogis(rnorm(20,0,3)),each=100)))

old <- glm(response~gender,binomial,observed)
old.p.value <- pchisq(old$null.deviance-old$deviance,df=1,lower.tail=F)
# 1.083e-56

null <- lmer(response ~ (1|subject),observed,binomial)
fixed <- lmer(response ~ gender + (1|subject),observed,binomial)

wald.p.value <-  summary(fixed)@coefs["genderM","Pr(>|z|)"] # 0.00894
lrt.p.value <- anova(null,fixed)[,"Pr(>Chisq)"][2] # 0.0152

null.intercept <- fixef(null) # -0.0914
null.subject.sd <- attr(VarCorr(null)$subject,"stddev") # 2.276
gender.observed <- fixef(fixed)[2] # -2.319
gender.simulated <- vector()

for (i in 1:1000){
simulated <- data.frame(subject = rep(letters[1:20],each=100),
	gender = rep(c("M","F"),each=1000),
	response = rbinom(2000,1,rep(plogis(rnorm(20,0,null.subject.sd)),each=100)))
fixed.sim <- lmer(response~gender+(1|subject),simulated,binomial)
gender.simulated[i] <- fixef(fixed.sim)[2]}

# minutes later...

simulation.p.value <- mean(abs(gender.simulated)>=abs(gender.observed)) # 0.027