Skip to content

Question about zero-inflated Poisson glmer

29 messages · Thierry Onkelinx, Mollie Brooks, Ben Bolker +2 more

Messages 1–25 of 29

#
Dear group - I am currently fitting a Poisson glmer where I have an 
excess of outcomes that are zero (>95%). I am now debating on how to 
proceed and came up with three options:

1.) Just fit a regular glmer to the complete data. I am not fully sure 
how interpret the coefficients then, are they more optimizing towards 
distinguishing zero and non-zero, or also capturing the differences in 
those outcomes that are non-zero?

2.) Leave all zeros out of the data and fit a glmer to only those 
outcomes that are non-zero. Then, I would only learn about differences 
in the non-zero outcomes though.

3.) Use a zero-inflated Poisson model. My data is quite large-scale, so 
I am currently playing around with the EM implementation of Bolker et 
al. that alternates between fitting a glmer with data that are weighted 
according to their zero probability, and fitting a logistic regression 
for the probability that a data point is zero. The method is elaborated 
for the OWL data in: 
https://groups.nceas.ucsb.edu/non-linear-modeling/projects/owls/WRITEUP/owls.pdf

I am not fully sure how to interpret the results for the zero-inflated 
version though. Would I need to interpret the coefficients for the 
result of the glmer similar to as I would do for my idea of 2)? And then 
on top of that interpret the coefficients for the logistic regression 
regarding whether something is in the perfect or imperfect state? I am 
also not quite sure what the common approach for the zformula is here. 
The OWL elaborations only use zformula=z~1, so no random effect; I would 
use the same formula as for the glmer.

I am appreciating some help and pointers.

Thanks!
Philipp
#
Dear Philipp,

Do you have just lots of zero's, or more zero's than the Poisson
distribution can explain? Those are two different things. The example below
generates data from a Poisson distribution and has 99% zero's but no
zero-inflation. The second example has only 1% zero's but is clearly
zero-inflated.

set.seed(1)
n <- 1e8
sim <- rpois(n, lambda = 0.01)
mean(sim == 0)
hist(sim)

sim.infl <- rbinom(n, size = 1, prob = 0.99) * rpois(n, lambda = 1000)
mean(sim.infl == 0)
hist(sim.infl)

So before looking for zero-inflated models, try to model the zero's.

Best regards,


ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-23 10:07 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Thanks Thierry - That totally makes sense. Is there some way of formally 
checking that, except thinking about the setting and underlying processes?
On 23.06.2016 11:04, Thierry Onkelinx wrote:

  
  
#
Dear Philipp,

1. Fit a Poisson model to the data.
2. Simulate a new response vector for the dataset according to the model.
3. Count the number of zero's in the simulated response vector.
4. Repeat step 2 and 3 a decent number of time and plot a histogram of the
number of zero's in the simulation. If the number of zero's in the original
dataset is larger than those in the simulations, then the model can't
capture all zero's. In such case, first try to update the model and repeat
the procedure. If that fails, look for zero-inflated models.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-23 11:27 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Thanks! Actually, accounting for overdispersion is super important as it 
seems, then the zeros can be captured well.
On 23.06.2016 11:50, Thierry Onkelinx wrote:

  
  
#
Be careful when using overdispersion to model zero-inflation. Those are two
different things.

I've put some information together in
http://rpubs.com/INBOstats/zeroinflation

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-23 12:42 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Thanks, great information, that is really helpful.

I agree that those are different things, however when using a random 
effect for overdispersion, I can simulate the same number of zero 
outcomes (~95%).
On 23.06.2016 15:50, Thierry Onkelinx wrote:

  
  
#
Hi Philipp,

You could also try fitting the model with and without ZI using either glmmADMB or glmmTMB. Then compare the AICs. I believe model selection is useful for this, but I could be missing something since the simulation procedure that Thierry described seems to recommended more often.

https://github.com/glmmTMB/glmmTMB
http://glmmadmb.r-forge.r-project.org

glmmTMB is still in the development phase, but we?ve done a lot of testing.

cheers,
Mollie

------------------------
Mollie Brooks, PhD
Postdoctoral Researcher, Population Ecology Research Group
Department of Evolutionary Biology & Environmental Studies, University of Z?rich
http://www.popecol.org/team/mollie-brooks/

  
  
#
I would also comment that glmmTMB is likely to be much faster than the
lme4-based EM approach ...

  cheers
    Ben B.
On 16-06-23 12:47 PM, Mollie Brooks wrote:
I am not fully sure how to interpret the results for the
#
Did try glmmADMB but unfortunately it is way too slow for my data.

Did not know about glmmTMB, will try it out. Does it work with crossed 
random effects and how does it scale with more data? I will check the 
docu and try it though. Thanks for the info.
On 23.06.2016 19:14, Ben Bolker wrote:
#
glmmTMB does crossed RE. Ben did some timings in vignette("glmmTMB") and it was 2.3 times faster than glmer for one simple GLMM.

  
  
#
It indeed seems to run quite fast; had some trouble installing, but 
works now on my 3.3 R setup.

One question I have is regarding the specification of dispersion as I 
need to specify the dispformula. What is the difference here between 
just specifying fixed effects vs. also the random effects?
On 23.06.2016 23:07, Mollie Brooks wrote:

  
  
#
Update, I tried it like that, but receive an error message.

Warning message:
In nlminb(start = par, objective = fn, gradient = gr): NA/NaN function evaluation

Error in solve.default(hessian.fixed): Lapack routine dgesv: system is exactly singular: U[3,3] = 0
Traceback:

1. glmmTMB(y ~ 1 + x + (1 | b),
  .     data = data, family = "poisson", dispformula = ~1 + x)
2. sdreport(obj)
3. solve(hessian.fixed)
4. solve(hessian.fixed)
5. solve.default(hessian.fixed)

Any ideas on that?

BTW: Is it fine to post glmmTMB questions here, or should I rather use 
the github issue page, or is there maybe a dedicated mailing list?

Thanks,
Philipp
On 24.06.2016 14:35, Philipp Singer wrote:

  
  
#
Probably for now the glmmTMB issues page is best.

 When you go there:

  - details on installation problems/hiccups would be useful
  - a reproducible example for the problem listed below would be useful
  - dispformula is for allowing dispersion/residual variance to vary
with covariates (i.e., modeling heteroscedasticity)

  cheers
    Ben Bolker
On 16-06-24 09:13 AM, Philipp Singer wrote:
#
Thanks - I started an issue there to answer some of my questions.

Regarding the installation: I was trying to somehow do it in anaconda 
with a specific R kernel and had some issues. I am trying to resort that 
with the anaconda guys though, if I have a tutorial on how to properly 
setup glmmTMB in anaconda, I will let you know. The install worked fine 
in my standard R environment.
On 24.06.2016 15:40, Ben Bolker wrote:
2 days later
#
I have now played around more with the data an the models both using lme4
and glmmTMB.

I can report the following:

Modeling the data with a zero-inflated Poisson improves the model
significantly. However, when calling predict and simulating rpoissons, I
end up with nearly no values that are zero (in the original data there are
96% zero).

When I model the data with overdisperion by including an observation-level
random effect, I can also improve the model (not surprisingly due to the
random effect). When I predict outcomes by ignoring the observation-level
random effect (in lme4), I receive bad prediction if I compare it to the
original data. While many zeros can be captured (of course), the positive
outcomes can not be captured well.

Combining zero inflation and overdispersion further improves the model, but
I can only do that with glmmTMB and then have troubles doing predictions
ignoring the observation-level random effect.

Another side question:

In lme4, when I do:

m = glm(x~1,family="poisson")
rpois(n=len(data),lambda=predict(m, type='response',re.form=NA)

vs.

simulate(1,m,re.form=NA)

I receive different outcomes? Do I understand these function wrongly?

Would appreciate some more help/pointers!

Thanks,
Philipp

2016-06-24 15:52 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Dear Philipp,

How strong is the variance of the observation level random effect? I would
trust a model with large OLRE variance.

Best regards,

Thierry

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-27 14:59 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
The variance is:

Conditional model:
 Groups            Name        Variance  Std.Dev.
 obs               (Intercept) 8.991e+01 9.4823139



2016-06-27 15:06 GMT+02:00 Thierry Onkelinx <thierry.onkelinx at inbo.be>:

  
  
#
Here is the fitted vs. residual plot for the observation-level poisson
model where the observation level has been removed as taken from:
https://stat.ethz.ch/pipermail/r-sig-mixed-models/2013q3/020817.html

So basically the prediction is always close to zero.

Note that this is just on a very small sample (1000 data points).

If I fit a nbinom2 to this smalle sample, I get predictions that are always
around ~20 (but never zero). Both plots are attached.

What I am wondering is whether I can do inference on a fixed parameter in
my model which is my main task of this study. The effect is similar in the
different models and in general I am only itnerested in whether it is
positive/negative and "significant" which it is. However, as can be seen,
the prediction looks not too good here.




2016-06-27 15:18 GMT+02:00 Philipp Singer <killver at gmail.com>:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fitted.png
Type: image/png
Size: 26021 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20160627/a817d3eb/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nbinom_fitted.png
Type: image/png
Size: 26194 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20160627/a817d3eb/attachment-0003.png>
#
Dear Philipp,

You've been bitten by observation level random effects. I've put together a
document about it on http://rpubs.com/INBOstats/OLRE. Bottomline you're
OKish when the standard devation of the OLRE smaller than 1. You're in
trouble when it's above 3. In between you need to check the model carefully.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-27 16:17 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Well, as posted beforehand the std dev is 9.5 ... so does not seem too 
good then :/

Any other idea?
On 27.06.2016 17:31, Thierry Onkelinx wrote:

  
  
#
If there is overdispersion, then try a negative binomial model or a
zero-inflated negative binomial model. If not try a zero-inflated Poisson.
Adding relevant covariates can reduce overdispersion.

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-27 17:46 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
NBinom was not really successfull unitl now, but will try to tune. 
Thanks for your help!

One point I forgot to mention was that apart from my excess of zeros, 
the lowest data outcome is 10, so there is a gap between zeri and 10. 
Could that be somehow a problem?
On 27.06.2016 21:59, Thierry Onkelinx wrote:

  
  
#
A gap between zero and the lowest non-zero count is normal for
zero-inflated data when the Poisson part has a high mean. In that case very
few (if any) zero's stem from the Poisson part.

Another option is to try a hurdle model. Or approximate a hurdle model by
fitting two separate models: a logistic regression (zero or not) and a
Poisson regression (count > 0). In case Prob(Poisson(mu) == 0) is small
then the approximation should be OK.

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-06-27 22:03 GMT+02:00 Philipp Singer <killver at gmail.com>:

  
  
#
Unfortunately, if I model the data with a zero-inflated negative 
binomial model (which appears to be the most appropriate model to me), 
the fitted values are never zero, but hover around a mean of 20, even 
though, as said, my data contains around 95% zeros.

I thought about hurdle models as well, but zero-inflated definitely fit 
the process better.
On 27.06.2016 21:59, Thierry Onkelinx wrote: