Skip to content

glmer Z-test with individual random effects

7 messages · Jens Åström, Ben Bolker, Jarrod Hadfield +2 more

#
Dear list,

As I have read (Bolker et al. 2009 TREE), the Wald Z test is only
appropriate for GLMMs in cases without overdispersion.

Assuming we use family=poisson with lmer and tackle overdispersion by
incorporating an individual random effect AND this adequately "reduces"
the overdispersion, is it then OK to use the Wald z test as reported by
lmer?

In other words, are the p-values reported by lmer in those cases
useful/"correct"? Or do they suffer from the usual problems with
figuring out the number of parameters used by the random effects?

Secondly, is it good practice to judge lmer's capability of "reducing"
the overdispersion by summing the squared residuals (pearson) and
compare this to a chi square distribution (with N-1 degrees of freedom)?


Thankful for any comments or suggestions,

Jens ?str?m
#
On 11/11/2010 09:58 AM, Jens ?str?m wrote:
They are equivalent to assuming an infinite/large 'denominator degrees
of freedom'.  If you have a large sample size (both a large number of
total samples relative to the number of parameters, and a large number
of random-effects levels/blocks) then this should be reasonable -- if
not, then yes, the 'usual problems with figuring out the number of
parameters' is relevant.  On the other hand, if you're willing to assume
that the sample size is large, then likelihood ratio rests
(anova(model1,model2)) are probably better than the Wald tests anyway.
I would say this is reasonable, although again it's a rough guide
because the true degrees of freedom are a bit fuzzy -- it should
probably be at most N-(fixed effect degrees of freedom)?

   Would be happy to hear any conflicting opinions.

  Ben Bolker
#
The Wald tests (as represented in glm and glmer output of z-statistics
and p-values) are approximate for both glm models and glmm models
with non-identity links. The approximation can fail badly if the link is
highly non-linear over a region of the response that is relevant for a
parameter of interest.  The Hauck-Donner phenomenon, where the 
z-statistic decreases as the effect estimate increases, is an extreme 
example.
(This happens, e.g., as one of the levels being compared gives a
fitted value that moves close to a binomial proportion of 0 or 1, or 
close to a Poisson estimate of 0.)

The additional complication for a glmm is that the SE may have
two components -- e.g., a poisson or binomial error, and a random
normal error that is added on the scale of the linear predictor.  This
random normal error somewhat alleviates variance change effects
(including Hauck-Donner) that result from non-linearity in the link,
while adding uncertainty to the SE estimate.  If the contribution from
the glm family error term is largish relative to contribution from the 
random normal error (> 3 or 4  times as large?), treating the z-statistic 
as normal may not in many circumstances be too unreasonable,
even if the relevant degrees of freedom are as small as maybe 4
(e.g., where the test is for consistency across 5 locations). 

On data where I seem to have this situation, fairly consistently 
across a number of responses, I initially used MCMCglmm().
I was concerned about the contribution from the random normal 
error (including a contribution from an observation level random 
effect term).  I found I hat I had to choose a somewhat informative 
prior (inverse Wishart with V=1, nu=0.002) to consistently get 
convergence.  With this prior, the MCMCglmm credible intervals 
were remarkably close to glmer confidence intervals, treating the 
z-statistics as normal.  

No doubt the effect of the chosen prior is to insist that the true 
random normal error variance is fairly close to the variance as 
estimated by glmer().  A frustration (for me, at least) with using 
MCMCglmm is that I do not know just what MCMCglmm() is doing 
in this respect, short of doing some careful investigative exploration 
of the MCMC simulation results (which is an after-the-event check, 
where I'd like to know before the event).  Comments, Jarrod?

All this is to emphasise that we are not in the arena, unless the
relevant degrees of freedom are large, of precise science.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On 12/11/2010, at 2:18 AM, Ben Bolker wrote:

            
#
Hi John,

I'm not sure which part of the MCMCglmm routine is unclear but in a  
nut shell:

The data are distributed according to the specified distribution (e.g.  
Poisson ) with the dsitribution parameter equal to the linear  
predictor (nu) on the inverse link scale:

y ~ Pois(exp(nu))

A linear (mixed) model is proposed for the linear predictor:

nu ~ N(Xb+Zu, I*Ve)

In all models a "units" term is included which is the same as an  
observational-level random effect.  This is the residual term in the  
linear part of the model.

Priors are passed as  a list of the form:

prior=list(B=list(),
                  R=list(),
                  G=list(G1=list(),
                              G2=list()))


where B is the prior for the fixed effects, R is the prior for Ve (or  
possibly a residual covariance matrix depending on the model) and the  
elements of G are the priors for each of the random effect  
(co)variances.


The prior for b is multivariate normal Pr(b) ~ N(B$mu, B$V)

The prior for Ve is inverse-Wishart Pr(Ve) ~ IW(R$nu, R$V*R$nu) where  
the first term is the degree of belief and R$V*R$nu is the scale matrix

The prior for the random effects variances (Vg) are set in the same  
way as Ve, although its possible to also fit parameter expanded priors.

When the variance term is scalar (rather than a covariance matrix)  
then the inverse-Wishart and inverse-gamma are the same and IW(R$nu, R 
$V*R$nu) = IG(R$nu/2, R$V*R$nu/2) where the two parameters are the  
shape and scale. Your specification R=list(V=1, nu=0.002)  is  
therefore  IG(0.001, 0.001) which used to be used a lot (and probably  
still is) as a "weak" prior by users of WinBUGS.  The REML estimator  
for Ve should be equal to the posterior mode in MCMCglmm when the  
improper prior R=list(V=0, nu=-2)  is used. 0 will not be accepted by  
MCMCglmm so you would have to use something small (e.g. 1e-6). The  
prior is improper so you have to be careful when using it, and  
although the point estimate may have some useful properties (the same  
as the REML estimate) the full posterior distribution may not be ideal  
with small sample sizes.

The sampling scheme is:

update nu using MH updates or slice sampling
Gibbs sample b and u together
Gibbs sample (co)variance components sequentially.

For other models  such as ordinal regression or simultaneous/recursive  
models there are other schemes also, but these are quite specialised.

Cheers,

Jarrod
On 11 Nov 2010, at 23:45, John Maindonald wrote:

            

  
    
#
HI Jarrod -
The mechanics of what is happening are not my issue.  What bothers 
me is that I do not have any intuitive sense of how the choice of prior 
may be constraining the variance estimates.  Is it for example possible 
to say that the effect is comparable to adding  some number of degrees 
of freedom to the variance estimate?  What is the balance between
what comes from the prior, and what derives from the data?
John.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On 12/11/2010, at 10:51 PM, Jarrod Hadfield wrote:

            
#
Dear Ben,

Can I just check, is it necessary to have both a large total sample
size and a large number of levels of your random effect(s) for a
likelihood ratio test to be robust? Or does the large number of levels
requirement apply only to the Wald test? I'm referring to the part of
your answer below.

Thanks,

Andy.


"They are equivalent to assuming an infinite/large 'denominator degrees
of freedom'.  If you have a large sample size (both a large number of
total samples relative to the number of parameters, and a large number
of random-effects levels/blocks) then this should be reasonable -- if
not, then yes, the 'usual problems with figuring out the number of
parameters' is relevant.  On the other hand, if you're willing to assume
that the sample size is large, then likelihood ratio rests
(anova(model1,model2)) are probably better than the Wald tests anyway."
#
Hi,

I think this is a hard question to answer for complex models  
(otherwise MCMC would be redundant) but in simple situations the  
inverse-Wishart prior can be thought of as adding nu effects with  
known squared deviation V.

For example, we could have 5 levels of some factor (rfac) and observe  
10 data with pairs of observations associated with each level of rfac:

rfac<-gl(5,2)
e.rfac<-rnorm(5, 0, sqrt(0.1))
y<-rnorm(10)+e.rfac[rfac]
dat<-data.frame(y=y, rfac=rfac)

Here the intercept is 0.0, the residual variance is 1.0 and the  
variance of effects associated with rfac is 0.1. I'll call this last  
variance Vg.  We can fix the posterior for all parameters at their  
true values except Vg which has an IW prior with V=1 and nu=1:

prior1=list(B=list(mu=0, V=1e-6), R=list(V=1, fix=1),  
G=list(G1=list(V=1, nu=1)))

m1<-MCMCglmm(y~1, random=~rfac, data=dat, prior=prior1)

We could also observe an additional 1000 data points associated with a  
6th level of rfac which has an effect of 1.0:

  dat2<-data.frame(y=c(y,rnorm(1000)+1), rfac=c(rfac,rep(6, 1000)))

and analyse as before but with a flat prior (i.e. nu=0):

prior2=list(B=list(mu=0, V=1e-6), R=list(V=1, fix=1),  
G=list(G1=list(V=1, nu=0)))

m2<-MCMCglmm(y~1, random=~rfac, data=dat, prior=prior2)

The posterior distribution for Vg should be (almost) the same in both  
cases

plot(density(m1$VCV[,1]))
lines(density(m2$VCV[,1]), col="red")

With multi-parameter models, or quantities that are functions of two  
or more parameters, then this interpretation can be misleading.  
However, in many instances the idea does seem to serve as a good  
heuristic. The danger of course is thinking how can adding 1 prior  
effect (i.e. nu =1) make that much difference (compared to nu=0, say).  
It can when a) factor levels are poorly replicated b) replication  
within factor levels is poor and c) when prior V conflicts with the  
actual value of Vg.

Cheers,

Jarrod
On 13 Nov 2010, at 00:42, John Maindonald wrote: