GLMM for data with many zeros: poisson, lognormal-poisson or zero-inflated?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear members of the list,

This is my first message and as I am very new to GLMMs (and my
statistical knowledge is far from broad) I hope to formulate my
questions adequately. I have spent some days reading through the list
archives to find a solution to my problem, but I am still not sure if
I am doing right. I am analyzing data on number of seedlings emerged
in experimental plots for a perennial plant (thus, my response
variable are counts). I have performed my experiment on 6 different
sites (random factor). Each site has two areas: one with presence and
one with absence of populations of this particular species (so I have
a factor called "species", which is binomial: 0/1).
Normally one would speak of *predictor* variables (which this is) as
"binary" rather than "binomial"
Into each of these
areas I have 4 replies. Two of these replies are under closed forest
canopy and two in open areas (I called this fixed factor "cover" and
it has two levels: "high" and "low"). Finally, each of the replies
consists of 3 different treatments applied to sowed seeds (fixed
factor with three levels: A, E, C). I have made periodic revisions of
this experiment, so I have 7 dates of revision till now. I am trying
to fit a GLMM with lmer ni package lme4, in order to see the effect of
all these factors and their interactions in seedling emergence. I am
thinking of fitting a different GLMM for each revision (1-7) to see if
there are changes in significant factors.
Fitting a different GLMM for each one is simpler, but if you actually
want to make any inferences about the changes between revisions (don't
quite know what this means) you should do a single combined analysis and
include interactions with 'revision' in order to test differences in the
effects of factors across revisions: e.g. cover:revision (interaction
between cover and revision) would parameterize the variation in effect
of cover between revisions.

  6 sites is a fairly small number for estimating a random effect, but
not impossible -- you may discover, depending on the amount of noise in
your data set, that the estimated across-site variance is zero.  In
principle you can also look at the variation across sites in the effects
of factors (e.g. (1|site:cover) or (cover|site)), but you may run out of
data if you try to do too much of this ...
Here is how part of my data (those for rev 1, site 1) look:

rev         site        species  repl        cover     treat      seedlings

1             1             0             1             low        A
         3

1             1             0             1             low        C
         0

1             1             0             1             low        E
          0

1             1             0             2             low        A
         5

1             1             0             2             low        C
         0

1             1             0             2             low        E
          3

1             1             0             3             high       A
         1

1             1             0             3             high       C
         0

1             1             0             3             high       E
          0

1             1             0             4             high       A
         0

1             1             0             4             high       C

...

My first problem is that 66% of this count data (taking all the
revisions together) are zeros (and into each revision the percentage
of zeros is also high). Should I then fit a zero-inflated model? This
is not available in lmer, and I investigated zeroinfl in the package
pscl but it does not allow to include random effects (and, to my
knowledge, I should include "site" as a random effect). 
Maybe, but not necessarily.  If the means are low enough then
zero-inflation may not be necessary.  glmmADMB will fit zero-inflated
Poisson (or negative binomial) models with a single random effect (the
new version, coming soon, will allow multiple random effects).
Poisson family
is strongly suggested for count data, but using goodfit in the vcd
package I found that my data are significantly different from the
Poisson distribution. So which family should I use? I have read about
the problem with quasi distributions and why they are not any more
available in lmer. I have also read some posts talking about
overdispersion and translating the model to a lognormal-Poisson model
by using an individual-level random variable and using family=poisson.
I have tried this with my data and it worked, but I am really not sure
if I have overdispersion or not, as I do not know which parameter do I
have to look at in order to quantify overdispersion.
In general you want to look at the residual deviance or sum of
squared Pearson residuals, although this is only approximate. Checking
whether the overall (marginal) distribution is Poisson doesn't mean very
much in general.
Here is my poisson model for the first revision (I made subsets of the
data, and data1 corresponds to the first revision):

model1<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+
(1+species+treat|site),family=poisson,data1)
My guess is that this is a bit too complicated in the random effects.
 I would see first if this works:

model1<-lmer(seedlings~treat*(species+cover)+(1|site),
       family=poisson,data1)

  then try to work up from there, seeing which combinations of (1)
zero-inflation (2) overdispersion (via negative binomial in glmmADMB or
via lognormal-Poisson in lme4) (3) more complex random effects your data
can accommodate.  Basically, you want to specify the most complex
*reasonable* model that your data can handle, then (if you are doing
hypothesis tests) drop selected terms from the full model (one at a
time) and evaluate the difference in likelihood/deviance.

  I'm not clear what 'revisions' are, and whether you want to treat them
as random effects (i.e., experimental blocks), or whether you have
specific interests/hypotheses about how the effects changed between
revisions ...
Then I created a new variable with a unique value for each observation
and included this as an additional random-effect term

data1$over <- 1:nrow(data1)

And this is the lognormal-Poisson model:

model2<-lmer(seedlings~species+cover+treat+cover:treat+species:treat+(1+species+treat|site)+(1|over),data
= data1,family = poisson)
This is the summary for both models:

summary(model1)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +      treat | site)
   Data: data1
   AIC   BIC logLik deviance
 407.3 463.1 -184.7    369.3
Random effects:
 Groups Name        Variance Std.Dev. Corr
 site   (Intercept)  0        0
        species1          0        0         NaN
        treatC        0        0         NaN   NaN
        treatE        0        0         NaN   NaN   NaN
Number of obs: 139, groups: site, 6

Fixed effects:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)     2.190e-01  2.374e-01   0.922   0.3563
species1            -3.586e-01  2.830e-01  -1.267   0.2050
coverlow        1.113e-01  2.803e-01   0.397   0.6913
treatC          -3.493e+01  2.432e+03  -0.014   0.9885
treatE          -1.136e-03  3.175e-01  -0.004   0.9971
coverlow:treatC  1.761e+01  1.803e+03   0.010   0.9922
coverlow:treatE  7.783e-01  3.534e-01   2.202   0.0277 *
species1:treatC       1.768e+01  1.632e+03   0.011   0.9914
species1:treatE       4.277e-01  3.451e-01   1.240   0.2152
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Correlation of Fixed Effects:
            (Intr) species1    covrlw treatC  treatE  cvrl:C cvrl:E species1:tC
species1         -0.539
coverlow    -0.623  0.041
treatC        0.000  0.000  0.000
treatE       -0.748  0.403  0.466  0.000
covrlw:trtC  0.000  0.000  0.000 -0.741  0.000
covrlw:trtE  0.494 -0.033 -0.793  0.000 -0.669  0.000
species1:treatC    0.000  0.000  0.000 -0.671  0.000  0.000  0.000
species1:treatE    0.442 -0.820 -0.034  0.000 -0.529  0.000  0.047  0.000
summary(model2)
Generalized linear mixed model fit by the Laplace approximation
Formula: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +      treat | site) + (1 | over)
   Data: data1
   AIC   BIC logLik deviance
 238.9 297.6 -99.47    198.9
Random effects:
 Groups Name        Variance Std.Dev. Corr
 over   (Intercept) 2.5371   1.5928
 site   (Intercept) 0.0000   0.0000
        species1         0.0000   0.0000     NaN
        treatC       0.0000   0.0000     NaN   NaN
        treatE       0.0000   0.0000     NaN   NaN   NaN
Number of obs: 139, groups: over, 139; site, 6

Fixed effects:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)      -0.8544     0.5720  -1.494    0.135
species1              -0.8135     0.6783  -1.199    0.230
coverlow          0.3388     0.6719   0.504    0.614
treatC           -35.3404  6943.1491  -0.005    0.996
treatE             0.2528     0.7643   0.331    0.741
coverlow:treatC   17.1976  5840.7446   0.003    0.998
coverlow:treatE    0.8608     0.8823   0.976    0.329
species1:treatC        17.2652  3754.0680   0.005    0.996
species1:treatE         0.4864     0.8843   0.550    0.582

Correlation of Fixed Effects:
            (Intr) species1    covrlw treatC  treatE  cvrl:C cvrl:E species1:tC
species1         -0.530
coverlow    -0.631  0.035
treatC        0.000  0.000  0.000
treatE       -0.748  0.396  0.472  0.000
covrlw:trtC  0.000  0.000  0.000 -0.841  0.000
covrlw:trtE  0.481 -0.027 -0.762  0.000 -0.635  0.000
species1:treatC    0.000  0.000  0.000 -0.541  0.000  0.000  0.000
species1:treatE    0.406 -0.767 -0.027  0.000 -0.528  0.000  0.020  0.000

I don't know why I get this 0 and NaN for the random effects. I tried
to look at the interaction of site with species an treatment, apart
from the single effect of site. Perhaps my formula specification is
not correct?

Besides, I performed an anova with the two models:

anova(model1,model2)
Data: data1
Models:
model1: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model1:     treat | site)
model2: seedlings ~ species + cover + treat + cover:treat +
species:treat + (1 + species +
model2:     treat | site) + (1 | over)
       Df    AIC    BIC   logLik  Chisq Chi Df Pr(>Chisq)
model1 19 407.30 463.06 -184.651
model2 20 238.93 297.62  -99.465 170.37      1  < 2.2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

So looking at this I believe I should choose model 2, but I still
don?t see how can I know if there is an important overdispersion in
model 1. And do you think this model 2 is acceptable having too many
zeros in my data? Or should I try to fit any other kind of model? I
believe that keeping site as a fixed effect and using zeroinfl in pscl
package would be not correct at all? do you agree?

And my last question is: do you think it is ok to fit a different
model for each revision, or could I try some kind of repeated-measures
approach? I don?t know if this is available for GLMMs, and it goes far
beyond my statistical knowledge, but perhaps there is some possible
way.

Thanks a lot for your attention, and sorry for the long message. Any
kind of help or hint would be very welcome!

Regards,

Alicia Vald?s Rapado
PhD Student
Dpto. Biolog?a Organismos y Sistemas - Unidad de Ecolog?a
Universidad de Oviedo
Espa?a - Spain
e-mail: valdesalicia.uo at uniovi.es

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3EIn0ACgkQc5UpGjwzenNBlgCdGlY7lDGvE8/0k0uDpgpJiUYz
utEAnR4mUzF6tseDK/cmUqpsYxZzh2Dv
=kEMb
-----END PGP SIGNATURE-----

GLMM for data with many zeros: poisson, lognormal-poisson or zero-inflated?

Thread (5 messages)