Poisson Regression: questions about tests of assumptions

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121014/0a1b6d1e/attachment.pl>

I would like to test in R what regression fits my data best. My dependent
variable is a count, and has a lot of zeros.

And I would need some help to determine what model and family to use
(poisson or quasipoisson, or zero-inflated poisson regression), and how to
test the assumptions.

1) Poisson Regression: as far as I understand, the strong assumption is
that dependent variable mean = variance. How do you test this? How close
together do they have to be? Are unconditional or conditional mean and
variance used for this? What do I do if this assumption does not hold?
There are various formal tests for this, e.g., dispersiontest() in package 
"AER". Alternatively, you can use a simple likelihood-ratio test (e.g., by 
means of lrtest() in "lmtest") between a poisson and negative binomial 
(NB) fit. The p-value can even be halved because the Poisson is on the 
border of the NB theta parameter range (theta = infty).

However, overdispersion can already matter before this is detected by a 
significance test. Hence, if in doubt, I would simply use an NB model and 
you're on the safe side. And if the NB's estimated theta parameter turns 
out to be extremely large (say beyond 20 or 30), then you can still switch 
back to Poisson if you want.
2) I read that if variance is greater than mean we have overdispersion, 
and a potential way to deal with this is including more independent 
variables, or family=quasipoisson. Does this distribution have any other 
requirements or assumptions? What test do I use to see whether 1) or 2) 
fits better - simply anova(m1,m2)?
quasipoisson yields the same parameter estimates as the poisson, only the 
inference is adjusted appropriately.
3) I also read that negative-binomial distribution can be used when 
overdispersion appears. How do I do this in R?
glm.nb() in "MASS" is one of standard options.
What is the difference to quasipoisson?
The NB is a likelihood-based model while the quasipoisson is not 
associated with a likelihood (but has the same conditional mean equation).
4) Zero-inflated Poisson Regression: I read that using the vuong test
checks what models fits better.
vuong (model.poisson, model.zero.poisson)
Is that correct?
It's one of the possibilities.
5) ats.ucla.edu has a section about zero-inflated Poisson Regressions, and
test the zeroinflated model (a) against the standard poisson model (b):
m.a <- zeroinfl(count ~ child + camper | persons, data = zinb)
m.b <- glm(count ~ child + camper, family = poisson, data = zinb)
vuong(m.a, m.b)
I don't understand what the "| persons" part of the first model does, and
why you can compare these models if. I had expected the regression to be
the same and just use a different family.
I recommend you read the associated documentation. See 
vignette("countreg", package = "pscl")

For glm.nb() I recommend its accompanying documentation, namely the MASS 
book.

hth,
Z
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121014/41f0c432/attachment.pl>
Hi Eiko,

This is not a direct response to your question, but I thought you
might find these pages helpful:

On negative binomial:

http://www.ats.ucla.edu/stat/R/dae/nbreg.htm

and zero inflated poisson:

http://www.ats.ucla.edu/stat/R/dae/zipoisson.htm

In general this page lists a variety of different models with examples:

http://www.ats.ucla.edu/stat/dae/

I wrote the code for most of those pages. Some of your questions seema
 little more specific than would be addressed on those general
introductory pages, but if there are particular things you would like
to see done, I am open to hearing about it.

One thing I do want to mention is that for publication or presentation
purposes, graphs of predicted counts can be quite nice for readers---I
tried to show how to do that in all of those pages, and you can even
use bootstrapping (which I gave examples for at least in the zero
inflated models) to get robust confidence intervals.

Cheers,

Josh
Thank you for the detailed answer, that was really helpful.
I did some excessive reading and calculating in the last hours since your
reply, and have a few (hopefully much more informed) follow up questions.

1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC values
are listed for the models Negative-binomial (NB), Zero-Inflated (ZI), ZI
NB, Hurdle NB, and Poisson (Standard). And although I found a way to
determine LLH and DF for all models, BIC & AIC values are not displayed by
default, neither using the code given in the vignette. How do I calculate
these? (AIC is given as per default only in 2 models, BIC in none).

2) For the zero-inflated models, the first block of count model
coefficients is only in the output in order to compare the changes,
correct? That is, in my results in a paper I would only report the second
block of (zero-inflation) model coefficients? Or do I misunderstand
something? I am confused because in their large table to compare
coefficients, they primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)
R> sapply(fm, function(x) coef(x)[1:8])

3)

There are various formal tests for this, e.g., dispersiontest() in package
"AER".

I have to run 9 models - I am testing the influence of several predictors
on different individual symptoms of a mental disorder, as "counted" in the
last week (0=never in the last week, to 3 = on all day within the last
week).
So I'm regressing the same predictors onto 9 different symptoms in 9
models.

Dispersiontest() for these 9 models result in 3-4 overdispersed models
(depending if testing one- or two-sided on p=.05 level), 2 underdispersed
models, and 4 non-significant models. The by far largest dispersion value
is 2.1 in a model is not overdispersed according to the test, but that's
the symptom with 80% zeros, maybe that plays a role here.

Does BN also make sense in underdispersed models?

However, overdispersion can already matter before this is detected by a
significance test. Hence, if in doubt, I would simply use an NB model and
you're on the safe side. And if the NB's estimated theta parameter turns
out to be extremely large (say beyond 20 or 30), then you can still switch
back to Poisson if you want.

Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here.

4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.
Could you recommend me what to do here?

Thanks
T

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
just a side note for your 4th question.

for a small sample, clarke test instead of vuong test might be more
appropriate and the calculation is so simple that even excel can
handle it :-)
I would like to test in R what regression fits my data best. My dependent
variable is a count, and has a lot of zeros.

And I would need some help to determine what model and family to use
(poisson or quasipoisson, or zero-inflated poisson regression), and how to
test the assumptions.

1) Poisson Regression: as far as I understand, the strong assumption is
that dependent variable mean = variance. How do you test this? How close
together do they have to be? Are unconditional or conditional mean and
variance used for this? What do I do if this assumption does not hold?

2) I read that if variance is greater than mean we have overdispersion, and
a potential way to deal with this is including more independent variables,
or family=quasipoisson. Does this distribution have any other requirements
or assumptions? What test do I use to see whether 1) or 2) fits better -
simply anova(m1,m2)?

3) I also read that negative-binomial distribution can be used when
overdispersion appears. How do I do this in R? What is the difference to
quasipoisson?

4) Zero-inflated Poisson Regression: I read that using the vuong test
checks what models fits better.
vuong (model.poisson, model.zero.poisson)
Is that correct?

5) ats.ucla.edu has a section about zero-inflated Poisson Regressions, and
test the zeroinflated model (a) against the standard poisson model (b):
m.a <- zeroinfl(count ~ child + camper | persons, data = zinb)
m.b <- glm(count ~ child + camper, family = poisson, data = zinb)
vuong(m.a, m.b)
I don't understand what the "| persons" part of the first model does, and
why you can compare these models if. I had expected the regression to be
the same and just use a different family.

Thank you
T

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

==============================
WenSui Liu
Credit Risk Manager, 53 Bancorp
wensui.liu at 53.com
513-295-4370

Thank you for the detailed answer, that was really helpful.?I did some 
excessive reading and calculating in the last hours since your reply, 
and have a few (hopefully much more informed) follow up questions.

1) In the?vignette("countreg", package = "pscl"), LLH, AIC and BIC 
values are listed for the models Negative-binomial (NB), Zero-Inflated 
(ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found a 
way to determine LLH and DF for all models, BIC &?AIC values are not 
displayed by default, neither using the code given in the vignette. How 
do I calculate these? (AIC is given as per default only in 2 models, BIC 
in none).
The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleaned-up version of the replication code.

Most likelihood-based models provide a logLik() method which extracts the 
fitted log-likelihood along with the corresponding degrees of freedom. 
Then the default AIC() method can compute the AIC and AIC(..., k = log(n)) 
computes the corresponding BIC. This is hinted at in Table 3 of the 
vignette. If "n", the number of observations, can be extracted by nobs(), 
then also BIC() works. This is not yet the case for zeroinfl/hurdle, 
though.
2) For the zero-inflated models, the first block of count model 
coefficients is only in the output in order to compare the changes, 
correct? That is, in my results in a paper I would only report the 
second block of (zero-inflation) model?coefficients? Or do I 
misunderstand something?
No, no, yes. Hurdle and zero-inflation models have two linear predictors, 
one for the zero hurdle/inflation and one for the truncated/un-inflated 
count component. Both are typically of interest for different aspects of 
the data.
I am
confused because in their large table to compare coefficients, they
primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)?
R> sapply(fm, function(x) coef(x)[1:8])
All models considered have a predictor for the mean of the count 
component, hence this can be compared across all models.
3)?
      There are various formal tests for this, e.g., dispersiontest()
      in package "AER".

I have to run 9 models - I am testing the influence of several 
predictors on different individual symptoms of a mental disorder, as 
"counted" in the last week (0=never in the last week, to 3 = on all day 
within the last week). So I'm regressing the same predictors onto 9 
different symptoms in 9 models.?
That's not really a count response. I guess an ordered response model (see 
e.g. polr() in MASS or package ordinal) would probably be more 
appropriate.
Dispersiontest() for these 9 models result in 3-4 overdispersed models
(depending if testing one- or two-sided on p=.05 level), 2 underdispersed
models, and 4 non-significant models. The by far largest dispersion value is
2.1 in a model is not overdispersed according to the test, but that's the
symptom with 80% zeros, maybe that plays a role here.?

Does BN also make sense in underdispersed models?
The variance function of the NB2 model is mu + 1/theta * mu^2. Hence for 
finite theta, the variance is always larger than the mean.
      However, overdispersion can already matter before this is
      detected by a significance test. Hence, if in doubt, I would
      simply use an NB model and you're on the safe side. And if the
      NB's estimated theta parameter turns out to be extremely large
      (say beyond 20 or 30), then you can still switch back to Poisson
      if you want.

Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here.?

4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.?
Could you recommend me what to do here??
As I said before: A theta around 20 or 30 is already so large that it is 
virtually identical to a Poisson fit (theta = Inf). These values clearly 
indicate that theta is not finite.

However, this almost certainly stems from your response which is not 
really count data. As I said above: An ordered response model will 
probably work better here. If you have mainly variation between 0 vs. 
larger but not much among the levels 1/2/3, a binary response (0 vs. 
larger) may be best.

hth,
Z
Thanks
T