An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121014/0a1b6d1e/attachment.pl>
Poisson Regression: questions about tests of assumptions
6 messages · Eiko Fried, Achim Zeileis, Joshua Wiley +1 more
On Sun, 14 Oct 2012, Eiko Fried wrote:
I would like to test in R what regression fits my data best. My dependent variable is a count, and has a lot of zeros. And I would need some help to determine what model and family to use (poisson or quasipoisson, or zero-inflated poisson regression), and how to test the assumptions. 1) Poisson Regression: as far as I understand, the strong assumption is that dependent variable mean = variance. How do you test this? How close together do they have to be? Are unconditional or conditional mean and variance used for this? What do I do if this assumption does not hold?
There are various formal tests for this, e.g., dispersiontest() in package "AER". Alternatively, you can use a simple likelihood-ratio test (e.g., by means of lrtest() in "lmtest") between a poisson and negative binomial (NB) fit. The p-value can even be halved because the Poisson is on the border of the NB theta parameter range (theta = infty). However, overdispersion can already matter before this is detected by a significance test. Hence, if in doubt, I would simply use an NB model and you're on the safe side. And if the NB's estimated theta parameter turns out to be extremely large (say beyond 20 or 30), then you can still switch back to Poisson if you want.
2) I read that if variance is greater than mean we have overdispersion, and a potential way to deal with this is including more independent variables, or family=quasipoisson. Does this distribution have any other requirements or assumptions? What test do I use to see whether 1) or 2) fits better - simply anova(m1,m2)?
quasipoisson yields the same parameter estimates as the poisson, only the inference is adjusted appropriately.
3) I also read that negative-binomial distribution can be used when overdispersion appears. How do I do this in R?
glm.nb() in "MASS" is one of standard options.
What is the difference to quasipoisson?
The NB is a likelihood-based model while the quasipoisson is not associated with a likelihood (but has the same conditional mean equation).
4) Zero-inflated Poisson Regression: I read that using the vuong test checks what models fits better.
vuong (model.poisson, model.zero.poisson)
Is that correct?
It's one of the possibilities.
5) ats.ucla.edu has a section about zero-inflated Poisson Regressions, and test the zeroinflated model (a) against the standard poisson model (b):
m.a <- zeroinfl(count ~ child + camper | persons, data = zinb) m.b <- glm(count ~ child + camper, family = poisson, data = zinb) vuong(m.a, m.b)
I don't understand what the "| persons" part of the first model does, and why you can compare these models if. I had expected the regression to be the same and just use a different family.
I recommend you read the associated documentation. See
vignette("countreg", package = "pscl")
For glm.nb() I recommend its accompanying documentation, namely the MASS
book.
hth,
Z
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121014/41f0c432/attachment.pl>
Hi Eiko, This is not a direct response to your question, but I thought you might find these pages helpful: On negative binomial: http://www.ats.ucla.edu/stat/R/dae/nbreg.htm and zero inflated poisson: http://www.ats.ucla.edu/stat/R/dae/zipoisson.htm In general this page lists a variety of different models with examples: http://www.ats.ucla.edu/stat/dae/ I wrote the code for most of those pages. Some of your questions seema little more specific than would be addressed on those general introductory pages, but if there are particular things you would like to see done, I am open to hearing about it. One thing I do want to mention is that for publication or presentation purposes, graphs of predicted counts can be quite nice for readers---I tried to show how to do that in all of those pages, and you can even use bootstrapping (which I gave examples for at least in the zero inflated models) to get robust confidence intervals. Cheers, Josh
On Sun, Oct 14, 2012 at 3:14 PM, Eiko Fried <torvon at gmail.com> wrote:
Thank you for the detailed answer, that was really helpful.
I did some excessive reading and calculating in the last hours since your
reply, and have a few (hopefully much more informed) follow up questions.
1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC values
are listed for the models Negative-binomial (NB), Zero-Inflated (ZI), ZI
NB, Hurdle NB, and Poisson (Standard). And although I found a way to
determine LLH and DF for all models, BIC & AIC values are not displayed by
default, neither using the code given in the vignette. How do I calculate
these? (AIC is given as per default only in 2 models, BIC in none).
2) For the zero-inflated models, the first block of count model
coefficients is only in the output in order to compare the changes,
correct? That is, in my results in a paper I would only report the second
block of (zero-inflation) model coefficients? Or do I misunderstand
something? I am confused because in their large table to compare
coefficients, they primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)
R> sapply(fm, function(x) coef(x)[1:8])
3)
There are various formal tests for this, e.g., dispersiontest() in package "AER".
I have to run 9 models - I am testing the influence of several predictors on different individual symptoms of a mental disorder, as "counted" in the last week (0=never in the last week, to 3 = on all day within the last week). So I'm regressing the same predictors onto 9 different symptoms in 9 models. Dispersiontest() for these 9 models result in 3-4 overdispersed models (depending if testing one- or two-sided on p=.05 level), 2 underdispersed models, and 4 non-significant models. The by far largest dispersion value is 2.1 in a model is not overdispersed according to the test, but that's the symptom with 80% zeros, maybe that plays a role here. Does BN also make sense in underdispersed models? However, overdispersion can already matter before this is detected by a
significance test. Hence, if in doubt, I would simply use an NB model and you're on the safe side. And if the NB's estimated theta parameter turns out to be extremely large (say beyond 20 or 30), then you can still switch back to Poisson if you want.
Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here.
4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.
Could you recommend me what to do here?
Thanks
T
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/
just a side note for your 4th question. for a small sample, clarke test instead of vuong test might be more appropriate and the calculation is so simple that even excel can handle it :-)
On Sun, Oct 14, 2012 at 12:00 PM, Eiko Fried <torvon at gmail.com> wrote:
I would like to test in R what regression fits my data best. My dependent variable is a count, and has a lot of zeros. And I would need some help to determine what model and family to use (poisson or quasipoisson, or zero-inflated poisson regression), and how to test the assumptions. 1) Poisson Regression: as far as I understand, the strong assumption is that dependent variable mean = variance. How do you test this? How close together do they have to be? Are unconditional or conditional mean and variance used for this? What do I do if this assumption does not hold? 2) I read that if variance is greater than mean we have overdispersion, and a potential way to deal with this is including more independent variables, or family=quasipoisson. Does this distribution have any other requirements or assumptions? What test do I use to see whether 1) or 2) fits better - simply anova(m1,m2)? 3) I also read that negative-binomial distribution can be used when overdispersion appears. How do I do this in R? What is the difference to quasipoisson? 4) Zero-inflated Poisson Regression: I read that using the vuong test checks what models fits better.
vuong (model.poisson, model.zero.poisson)
Is that correct? 5) ats.ucla.edu has a section about zero-inflated Poisson Regressions, and test the zeroinflated model (a) against the standard poisson model (b):
m.a <- zeroinfl(count ~ child + camper | persons, data = zinb) m.b <- glm(count ~ child + camper, family = poisson, data = zinb) vuong(m.a, m.b)
I don't understand what the "| persons" part of the first model does, and
why you can compare these models if. I had expected the regression to be
the same and just use a different family.
Thank you
T
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
============================== WenSui Liu Credit Risk Manager, 53 Bancorp wensui.liu at 53.com 513-295-4370
On Sun, 14 Oct 2012, Eiko Fried wrote:
Thank you for the detailed answer, that was really helpful.?I did some
excessive reading and calculating in the last hours since your reply,
and have a few (hopefully much more informed) follow up questions.
1) In the?vignette("countreg", package = "pscl"), LLH, AIC and BIC
values are listed for the models Negative-binomial (NB), Zero-Inflated
(ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found a
way to determine LLH and DF for all models, BIC &?AIC values are not
displayed by default, neither using the code given in the vignette. How
do I calculate these? (AIC is given as per default only in 2 models, BIC
in none).
The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleaned-up version of the replication code.
Most likelihood-based models provide a logLik() method which extracts the
fitted log-likelihood along with the corresponding degrees of freedom.
Then the default AIC() method can compute the AIC and AIC(..., k = log(n))
computes the corresponding BIC. This is hinted at in Table 3 of the
vignette. If "n", the number of observations, can be extracted by nobs(),
then also BIC() works. This is not yet the case for zeroinfl/hurdle,
though.
2) For the zero-inflated models, the first block of count model coefficients is only in the output in order to compare the changes, correct? That is, in my results in a paper I would only report the second block of (zero-inflation) model?coefficients? Or do I misunderstand something?
No, no, yes. Hurdle and zero-inflation models have two linear predictors, one for the zero hurdle/inflation and one for the truncated/un-inflated count component. Both are typically of interest for different aspects of the data.
I am
confused because in their large table to compare coefficients, they
primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb)?
R> sapply(fm, function(x) coef(x)[1:8])
All models considered have a predictor for the mean of the count component, hence this can be compared across all models.
3)?
There are various formal tests for this, e.g., dispersiontest()
in package "AER".
I have to run 9 models - I am testing the influence of several
predictors on different individual symptoms of a mental disorder, as
"counted" in the last week (0=never in the last week, to 3 = on all day
within the last week). So I'm regressing the same predictors onto 9
different symptoms in 9 models.?
That's not really a count response. I guess an ordered response model (see e.g. polr() in MASS or package ordinal) would probably be more appropriate.
Dispersiontest() for these 9 models result in 3-4 overdispersed models (depending if testing one- or two-sided on p=.05 level), 2 underdispersed models, and 4 non-significant models. The by far largest dispersion value is 2.1 in a model is not overdispersed according to the test, but that's the symptom with 80% zeros, maybe that plays a role here.? Does BN also make sense in underdispersed models?
The variance function of the NB2 model is mu + 1/theta * mu^2. Hence for finite theta, the variance is always larger than the mean.
However, overdispersion can already matter before this is
detected by a significance test. Hence, if in doubt, I would
simply use an NB model and you're on the safe side. And if the
NB's estimated theta parameter turns out to be extremely large
(say beyond 20 or 30), then you can still switch back to Poisson
if you want.
Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here.?
4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000.?
Could you recommend me what to do here??
As I said before: A theta around 20 or 30 is already so large that it is virtually identical to a Poisson fit (theta = Inf). These values clearly indicate that theta is not finite. However, this almost certainly stems from your response which is not really count data. As I said above: An ordered response model will probably work better here. If you have mainly variation between 0 vs. larger but not much among the levels 1/2/3, a binary response (0 vs. larger) may be best. hth, Z
Thanks T