GLMM for data with many zeros: poisson, lognormal-poisson or zero-inflated?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11-05-05 05:47 AM, Alicia Vald?s wrote:
Dear members of the list, This is my first message and as I am very new to GLMMs (and my statistical knowledge is far from broad) I hope to formulate my questions adequately. I have spent some days reading through the list archives to find a solution to my problem, but I am still not sure if I am doing right. I am analyzing data on number of seedlings emerged in experimental plots for a perennial plant (thus, my response variable are counts). I have performed my experiment on 6 different sites (random factor). Each site has two areas: one with presence and one with absence of populations of this particular species (so I have a factor called "species", which is binomial: 0/1).
Normally one would speak of *predictor* variables (which this is) as "binary" rather than "binomial"
Into each of these areas I have 4 replies. Two of these replies are under closed forest canopy and two in open areas (I called this fixed factor "cover" and it has two levels: "high" and "low"). Finally, each of the replies consists of 3 different treatments applied to sowed seeds (fixed factor with three levels: A, E, C). I have made periodic revisions of this experiment, so I have 7 dates of revision till now. I am trying to fit a GLMM with lmer ni package lme4, in order to see the effect of all these factors and their interactions in seedling emergence. I am thinking of fitting a different GLMM for each revision (1-7) to see if there are changes in significant factors.
Fitting a different GLMM for each one is simpler, but if you actually want to make any inferences about the changes between revisions (don't quite know what this means) you should do a single combined analysis and include interactions with 'revision' in order to test differences in the effects of factors across revisions: e.g. cover:revision (interaction between cover and revision) would parameterize the variation in effect of cover between revisions. 6 sites is a fairly small number for estimating a random effect, but not impossible -- you may discover, depending on the amount of noise in your data set, that the estimated across-site variance is zero. In principle you can also look at the variation across sites in the effects of factors (e.g. (1|site:cover) or (cover|site)), but you may run out of data if you try to do too much of this ...
Here is how part of my data (those for rev 1, site 1) look:
rev site species repl cover treat seedlings
1 1 0 1 low A
3
1 1 0 1 low C
0
1 1 0 1 low E
0
1 1 0 2 low A
5
1 1 0 2 low C
0
1 1 0 2 low E
3
1 1 0 3 high A
1
1 1 0 3 high C
0
1 1 0 3 high E
0
1 1 0 4 high A
0
1 1 0 4 high C
...
My first problem is that 66% of this count data (taking all the
revisions together) are zeros (and into each revision the percentage
of zeros is also high). Should I then fit a zero-inflated model? This
is not available in lmer, and I investigated zeroinfl in the package
pscl but it does not allow to include random effects (and, to my
knowledge, I should include "site" as a random effect).
Maybe, but not necessarily. If the means are low enough then zero-inflation may not be necessary. glmmADMB will fit zero-inflated Poisson (or negative binomial) models with a single random effect (the new version, coming soon, will allow multiple random effects).
Poisson family is strongly suggested for count data, but using goodfit in the vcd package I found that my data are significantly different from the Poisson distribution. So which family should I use? I have read about the problem with quasi distributions and why they are not any more available in lmer. I have also read some posts talking about overdispersion and translating the model to a lognormal-Poisson model by using an individual-level random variable and using family=poisson. I have tried this with my data and it worked, but I am really not sure if I have overdispersion or not, as I do not know which parameter do I have to look at in order to quantify overdispersion.
In general you want to look at the residual deviance or sum of squared Pearson residuals, although this is only approximate. Checking whether the overall (marginal) distribution is Poisson doesn't mean very much in general.
Here is my poisson model for the first revision (I made subsets of the data, and data1 corresponds to the first revision): model1<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+ (1+species+treat|site),family=poisson,data1)
My guess is that this is a bit too complicated in the random effects.
I would see first if this works:
model1<-lmer(seedlings~treat*(species+cover)+(1|site),
family=poisson,data1)
then try to work up from there, seeing which combinations of (1)
zero-inflation (2) overdispersion (via negative binomial in glmmADMB or
via lognormal-Poisson in lme4) (3) more complex random effects your data
can accommodate. Basically, you want to specify the most complex
*reasonable* model that your data can handle, then (if you are doing
hypothesis tests) drop selected terms from the full model (one at a
time) and evaluate the difference in likelihood/deviance.
I'm not clear what 'revisions' are, and whether you want to treat them
as random effects (i.e., experimental blocks), or whether you have
specific interests/hypotheses about how the effects changed between
revisions ...
Then I created a new variable with a unique value for each observation and included this as an additional random-effect term data1$over <- 1:nrow(data1) And this is the lognormal-Poisson model: model2<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+(1+species+treat|site)+(1|over),data = data1,family = poisson)
This is the summary for both models:
summary(model1)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species + treat | site)
Data: data1
AIC BIC logLik deviance
407.3 463.1 -184.7 369.3
Random effects:
Groups Name Variance Std.Dev. Corr
site (Intercept) 0 0
species1 0 0 NaN
treatC 0 0 NaN NaN
treatE 0 0 NaN NaN NaN
Number of obs: 139, groups: site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.190e-01 2.374e-01 0.922 0.3563
species1 -3.586e-01 2.830e-01 -1.267 0.2050
coverlow 1.113e-01 2.803e-01 0.397 0.6913
treatC -3.493e+01 2.432e+03 -0.014 0.9885
treatE -1.136e-03 3.175e-01 -0.004 0.9971
coverlow:treatC 1.761e+01 1.803e+03 0.010 0.9922
coverlow:treatE 7.783e-01 3.534e-01 2.202 0.0277 *
species1:treatC 1.768e+01 1.632e+03 0.011 0.9914
species1:treatE 4.277e-01 3.451e-01 1.240 0.2152
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Correlation of Fixed Effects:
(Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
species1 -0.539
coverlow -0.623 0.041
treatC 0.000 0.000 0.000
treatE -0.748 0.403 0.466 0.000
covrlw:trtC 0.000 0.000 0.000 -0.741 0.000
covrlw:trtE 0.494 -0.033 -0.793 0.000 -0.669 0.000
species1:treatC 0.000 0.000 0.000 -0.671 0.000 0.000 0.000
species1:treatE 0.442 -0.820 -0.034 0.000 -0.529 0.000 0.047 0.000
summary(model2)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species + treat | site) + (1 | over)
Data: data1
AIC BIC logLik deviance
238.9 297.6 -99.47 198.9
Random effects:
Groups Name Variance Std.Dev. Corr
over (Intercept) 2.5371 1.5928
site (Intercept) 0.0000 0.0000
species1 0.0000 0.0000 NaN
treatC 0.0000 0.0000 NaN NaN
treatE 0.0000 0.0000 NaN NaN NaN
Number of obs: 139, groups: over, 139; site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8544 0.5720 -1.494 0.135
species1 -0.8135 0.6783 -1.199 0.230
coverlow 0.3388 0.6719 0.504 0.614
treatC -35.3404 6943.1491 -0.005 0.996
treatE 0.2528 0.7643 0.331 0.741
coverlow:treatC 17.1976 5840.7446 0.003 0.998
coverlow:treatE 0.8608 0.8823 0.976 0.329
species1:treatC 17.2652 3754.0680 0.005 0.996
species1:treatE 0.4864 0.8843 0.550 0.582
Correlation of Fixed Effects:
(Intr) species1 covrlw treatC treatE cvrl:C cvrl:E species1:tC
species1 -0.530
coverlow -0.631 0.035
treatC 0.000 0.000 0.000
treatE -0.748 0.396 0.472 0.000
covrlw:trtC 0.000 0.000 0.000 -0.841 0.000
covrlw:trtE 0.481 -0.027 -0.762 0.000 -0.635 0.000
species1:treatC 0.000 0.000 0.000 -0.541 0.000 0.000 0.000
species1:treatE 0.406 -0.767 -0.027 0.000 -0.528 0.000 0.020 0.000
I don't know why I get this 0 and NaN for the random effects. I tried
to look at the interaction of site with species an treatment, apart
from the single effect of site. Perhaps my formula specification is
not correct?
Besides, I performed an anova with the two models:
anova(model1,model2)
Data: data1
Models:
model1: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model1: treat | site)
model2: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model2: treat | site) + (1 | over)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
model1 19 407.30 463.06 -184.651
model2 20 238.93 297.62 -99.465 170.37 1 < 2.2e-16 ***
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
So looking at this I believe I should choose model 2, but I still
don?t see how can I know if there is an important overdispersion in
model 1. And do you think this model 2 is acceptable having too many
zeros in my data? Or should I try to fit any other kind of model? I
believe that keeping site as a fixed effect and using zeroinfl in pscl
package would be not correct at all? do you agree?
And my last question is: do you think it is ok to fit a different
model for each revision, or could I try some kind of repeated-measures
approach? I don?t know if this is available for GLMMs, and it goes far
beyond my statistical knowledge, but perhaps there is some possible
way.
Thanks a lot for your attention, and sorry for the long message. Any
kind of help or hint would be very welcome!
Regards,
Alicia Vald?s Rapado
PhD Student
Dpto. Biolog?a Organismos y Sistemas - Unidad de Ecolog?a
Universidad de Oviedo
Espa?a - Spain
e-mail: valdesalicia.uo at uniovi.es
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk3EIn0ACgkQc5UpGjwzenNBlgCdGlY7lDGvE8/0k0uDpgpJiUYz utEAnR4mUzF6tseDK/cmUqpsYxZzh2Dv =kEMb -----END PGP SIGNATURE-----