Zero-inflated mixed effects model - clarification of zeros modeled and R package questions - R-SIG-mixed-models

Sat, Jul 7, 2012 4:44 PM #

A couple of quick responses:

Hi folks,

  [snip]

Thanks for replying so quickly Alain ? it?s much appreciated. To follow-up
on your comments:

-          Re: Spatial Autocorrelation  - I have dealt with spatial
autocorrelation in the past, though with continuous log-normal data (no
random effects - hence I used a spatial autoregressive model). I have
mentioned the likelihood of spatial autocorrelation in the residuals to my
employer/supervisor; however, he has advised that we proceed with the model
without accounting for autocorrelation, expecting that a large part it may
be explained by the environmental variables (which are no doubt clustered)
once the model is fitted. I?m skeptical, as some of these species also
might seek ?safety in numbers? selecting sites based on the abundance of
conspecifics nearby, and large flocks at a given site are likely to utilize
habitat at neighboring sites as well (if suitable). We shall see!

BMB>  You can always do a post-fitting test, graphical or statistical, for
the presence of spatial autocorrelation -- if you don't see anything
(clustering of residuals in a spatial plot of residuals, significant
Moran's I, or interesting-looking spatial variogram/correlogram)
then you should be OK ...

-          Re: random effect in the binomial process of a ZIP - don?t I
have to include this, given the repeated measures?

BMB>  It depends.  In principle, there could be a random effect in
the binomial process of the ZIP.  In practice, at some point the
model becomes too computationally unwieldy/unstable, due to complexity
and possible overfitting.  Again, you can take the general strategy
of leaving out potentially difficult model complications, then see if
you can detect them in the residuals (in this case, differences in
deviation between predicted vs actual zeros in different groups)

Thanks to everyone else as well for your input. After reading your
responses, and diving into the lit a little more, you've convinced me that
MCMC is the way to go. However, I now have a few more quick (hopefully?)
questions:
- Because I'm a tad afraid of WinBugs, I decided to look at MCMCglmm as
well. I noticed that the course notes for MCMCglmm state that ?*As is often
the case the parameters of the zero-inflation model mixes poorly? Poor
mixing is often associated with distributions that may not be zero-inflated
but instead over-dispersed.*?  Am I correct in thus assuming that if the
data are indeed zero-inflated, ?poor mixing? is not a problem? Or might
this also arise through other means?

BMB>  Poor mixing can happen any time you have a complex model.
Check the trace plots.

- Is there an advantage to using MCMCglmm versus winBUGS or vice versa? It
seems either one will take some time to correctly code/specify, so I might
as well go the route that makes the most sense/is more highly recommended.

BMB> WinBUGS is more flexible, MCMCglmm is (much) faster and easier
for those problems which it can handle.  If you don't see yourself
needing to go beyond the problems that MCMCglmm can handle, I would
stick with it.

- And most importantly: As I mentioned in my original message, we had
wanted to compare competing hypotheses for what shoreline attributes
influence shorebird distributions, and to then use MMI in prediction;
however, I?ve read that DIC is not recommended for mixed effects models
(even though MuMIn accepts MCMCglmm output). According to a post by Jarrod
Hadfield, this is especially true for non-Gaussian data because the level
of focus is on the sampled observations (i.e., for ?*observations (y) on
children within schools...DIC would be focused at "can we predict how many
times *these* children miss the bus*"*)*. What are my options then for
model comparison/selection and prediction? Recall that we want to estimate
the total abundance of each shorebird species within the entire study
region (with confidence intervals). I'm really stuck here...

BMB> DIC is indeed problematic for several reasons: there's the
level-of-focus problem, and the problem that its derivation assumes
multivariate normal posterior distributions ...  You could try to count
parameters in a naive way (i.e. one parameter per variance or
covariance parameter, which is probably the right way to do it
for the "population" level of focus -- see Vaida and Blanchard 2005),
and use AIC based on the mean deviance as suggested by
Brooks, S.  2002.  Discussion of the paper by Spiegelhalter, Best,
Carlin, and van der Linde.  Journal of the Royal Statistical Society
B.  64: 616-618.

  I would also say that you could just hope that one model
stands out so that you don't have to use MMI ...

  Ben Bolker

Thanks in advance... this is a huge statistical leap for me.

Cheers,
Jenn

On Thu, Jun 21, 2012 at 8:43 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:

Dear Jennifer:
Response below

On Wed, Jun 20, 2012 at 5:32 PM, Jennifer Barrett
<jenn.s.barrett at gmail.com> wrote:

Hi folks,


I?m looking for some guidance in regards to zero-inflated models with
repeated measures (i.e., random effect for site). My first question is

more

of a statistical one, while the second is related to R packages.

Apologies

for the long post; however, I want to make sure my concerns/questions are
clear!


Our project and dataset:


- The aim of our project is to 1) examine associations between shoreline
habitat characteristics and the abundance of several shorebird species;

and

2) estimate the total abundance of each shorebird species within the

entire

study region based on the models from 1) above, with confidence

intervals.

Note that we will be using an information theoretic approach for 1)

above,

and would like to use MMI for 2).

- Our response dataset consists of counts of shorebirds at >150 coastal
sites, conducted on the second Sunday of each month between the months of
Oct-March, over 10 years; however, not every site was surveyed in all
months (we?ve limited our dataset to those with a minimum of 3 counts in

year).  Our response variable is thus the number of birds counted in a
given month/year at a given site. Note that we plan to model each year
separately.

-  The habitat dataset consists of shoreline units within our entire

study

region, with each unit characterized by exposure, substrate type...etc.
Using GIS, we?ve measured the length of shoreline belonging to shoreline
categories (e.g., sand, rock, mud) within each survey site, the average
exposure for the site, and other continuous attributes, as well as one
presence/absence covariate.

- Initial exploratory analysis has shown that the counts are

zero-inflated.

While there may be some false zeros in our dataset (i.e., observer

error),

the source of the zero-inflation is likely preference of shorebirds for
particular sites with particular features and avoidance of others (i.e.,
true zeros or ?structural zeros?). Some zeros likely also arise because

the

species does not saturate its habitat (i.e., habitat suitable, but
unoccupied ? also a ?true? zero), though again, the majority of the zeros
are likely structural.


Onto my questions:


1) I?ve been reading through the literature to decide what type of model
would best be suited for our dataset and questions. While all articles

seem

to agree that the choice of a model needs to consider the source of

excess

zeros, they seem to contradict one another in regards to what zeros are
being modeled in each component of a zero-inflated mixture model. Note

that

I am not considering a two-part (i.e., conditional) model, because I do

not

believe that all zeros arise from the occupancy process (as per Joseph et
al. 2009 and as noted above, zero abundance can occur by chance in our
system). Examples:


- Martin et al. (2005) state that when zero inflation is due to true

zeros,

two-part or mixture models (ZIP or ZINB) are recommended, and that when
zero inflation is due to false zeros, a ZIB mixture model is recommended;
however, when zero inflation is due to both excess true and false zeros,

Bayesian framework may be used, though there is no formal discussion in

the

literature. NOTE: Since this article was published, Royle?s N-mixture

model

has addressed this issue; however, I cannot use this approach as my data

do

not meet the assumption of a closed population during the study period.

- In contrast to Martin et al. (2005), Potts and Elith (2006) state that
the zero-inflated mixture model structure implies that zero observations
arising from the zero process are true negative observations, and that
those arising from the Poisson process are false negative observations

?that

is, the habitat is suitable, but unoccupied? (p.155). However, on the
previous page, they defined false negative as ?attributable to

experimental

design? or observer error?, and habitat that is ?suitable, but

unoccupied?

as a true negative, so I'm not sure which type of zero observation they

are

really referring to here for the Poisson process.

- In contrast to both sources above, Zuur et al. (2009) state that in a

ZIP

or ZINB, zeros are modeled as coming from two processes ? the binomial
process, which models only false zeros (observer, design, and survey

error)

and the Poisson (or Negbin) process  which models the true zeros and
counts. This is the opposite of what was stated by Potts and Elith.

- Finally, I?ve read other sources which state that ZIPs simply treat the
population as a mixture, with one set of subjects having a zero response

in other words, there is no mention of whether the zero process is

modeling

the ?true? or ?false? zeros.


Thinking about my system: there are a bunch of sites where the birds (of

given species) never go (habitat is unsuitable), and a bunch where they

do

go with varying levels of abundance (habitat is suitable, but come sites
are more favored than others, based on habitat features). Following the
last bullet above, a site that is suitable may have a count of zero

simply

because the species wasn?t present there on the survey day (i.e., true

zero

occurring by chance). Given the contradicting information above, and the
consensus on the importance of considering the source of zeros in model
selection, I would very much appreciate if someone could clear this up

for

me - or let me know if I'm completely missing something here? Perhaps

this

question should be posed on a stats forum, but given question 2 below, I
thought I'd try here first.


2) Assuming that I?m on the right track with a ZIP, is there a package I
can use to model a ZIP with a random effect for site? I looked at

glmmADMB;

however, the zero inflation can only be modeled as a constant. This

doesn?t

make sense for my system, as the zero-inflation will be a function of
habitat covariates (see above). Likewise, glmmPQL is not an option, as

this

method does not yield log-likelihoods (and thus no AIC). I?m also

thinking

that the random effect will have to be included in the zero process as

well

? is this right?

Some of your jargon is unfamiliar to me--"true" and "false" zeros. I
suppose a false zero would be the result of a "hurdle process" (as in
the pscl package).  I've not seen a hurdle model joined in the same
with a zero-inflation model.  Certainly not with "random effects"
apart from the inflated zeros.

Although I do not believe there is an ML solution for your problem
within easy reach. However, there are Bayesian answers. Please see the
package MCMCglmm.  It has a very well done pair of vignettes.

MCMCglmm has a ZIP family option, and you can add random effects.
Jarod Hadfield has been a regular contributor here and I think if you
post your working example code he and others will be glad to help out.

pj



--
Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu

Jennifer Barrett, BSc., MRM
Research Associate
Centre for Wildlife Ecology
Simon Fraser University

	[[alternative HTML version deleted]]