Skip to content

hurdle model

9 messages · Gavin Simpson, Jari Oksanen, Peter Solymos +2 more

#
On Thu, 2010-08-19 at 10:30 +0200, Yingjie Zhang wrote:
Please keep discussion on list; just because I replied doesn't give you
a direct line to my inbox...

Why would you want a quasi-likelihood when you could have the real
thing? Seriously, if there is no likelihood you can't do likelihood
ratio tests, compare models using AIC/BIC etc.

Just use hurdle() if it fits the form of model you are after. don't
worry about likelihoods, quasi or otherwise. Check you are happy with
the range of models you could fit with hurdle() and use it. If you
aren't happy then you'd need to look elsewhere, but don't get hung up on
the quasi-likelihood bit.

My Tuppence,

G

  
    
#
Hi,

There is a reason why am I addict to Quasi likelihood, since Hurdle  from 'pscl' use Zero Truncated Poisson regression for the non-zero part, which incapable of handling the over-disperson comes from the positive part of the data. Apparently, Quasi likelihood is at least a better choice. I've noticed the hurdle they used for the paper comes from package 'stats' instead of 'pscl', I didn't find this version of hurdle in r...
On 19 Aug 2010, at 10:55, Gavin Simpson wrote:

            
#
On Thu, 2010-08-19 at 11:14 +0200, Yingjie Zhang wrote:
Quasi-likelihood isn't solving the "over-dispersion comes from positive
part". It is a means of fitting models, just like maximum likelihood
etc. It will be the authors model that does the accounting for over
dispersion. They solve the parameters of this model using
quasi-likelihood.

Your claim about hurdle in stats is incorrect:
no object named ?hurdle? was found
no object named ?hurdle? was found

So they must be using something else. Here's a thought; why not give us
the reference/citation for the paper you are reading --- it is difficult
to speculate further without more details like the actual paper?

Hurdle models fit a point mass at zero, whilst the count part of the
model is truncated to not allow any further zeros be produced from it.

A zeroinflated (zeroinfl() in pscl) model fits a point mass at zero and
has an untruncated count model which will allow extra zeros be produced.

In both cases a negative binomial model may be fitted to the count part,
which may be sufficient to cope with remaining overdispersion in the
count part of your model.

I think you would be better off thinking where the overdispersion is
coming from and choosing an appropriate means to model it. You are being
blinded by this talk of quasi-likelihoods. There may well be a way of
fitting the model you want in R without resorting to quasi-likelihood
tricks. But as you haven't told us what model you want to fit or a
citation for the paper you want to replicate, there isn't much further
we can do.

HTH

G

  
    
#
Thanks for the details, the paper is 'Comparing species abundance models' by Joanne M.Potts, Jane Elith.  Click the link... on page 158, in the table, they compare 5 models, both Quasi-likelihood and Hurdle are mentioned.

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6VBS-4KD5C2N-1&_user=794998&_coverDate=11%2F16%2F2006&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1435498227&_rerunOrigin=google&_acct=C000043466&_version=1&_urlVersion=0&_userid=794998&md5=fc0c4ebc77917948c90f8f0ee3bbe141

Maybe we went too far, when I read the paper above, I just thought it would be interesting to try the method they mentioned. My data have both characteristics: excess 0s and over dispersion of  positive part. And I am quite  convinced that the 0s have a single source ... that's why I didn't use ZIP/ZINB. 

Maybe for the excess 0s, over-dispersion and one source of 0s, the best model is Hurdle with truncated negative binomial, but my motive is to make sure that which ML method that Hurdle use.

Cheers,
Yingjie
On 19 Aug 2010, at 11:49, Gavin Simpson wrote:

            
#
On Thu, 2010-08-19 at 13:20 +0200, Yingjie Zhang wrote:
They fit several models and compare them:

     I. Poisson
    II. Negative Binomial
   III. Quasi-likelihood
    IV. Hurdle model
     V. zero-inflated model

III should be a quasi-poisson model, i.e. you fit the Poisson GLM using
quasi-likelihood and model the dispersion parameter \phi alongside the
usual Poisson GLM parameters.

Section 2.3 of their paper on the hurdle model doesn't even mention
"quasi". Though they do mention this in Table2.

Reading this, I think they cooked this model themselves - you can fit a
binomial model yourself for the presence absence and then fit a count
model for the samples predicted to be present from the binomial part. To
make things simple I suspect they fitted the count part as quasi-Poisson
but no-where does it say exactly what they did.

You would be better off fitting the hurdle as I mentioned using hurdle()
in pscl; fitting things using quasi-likelihood is just asking for
trouble if there are proper likelihood options available.

Read the vignette that accompanies the pscl package for details of how
it fits the various models including the hurdle. This includes the
likelihood functions that are optimised as part of the fitting.

HTH

G

  
    
#
On Thu, 2010-08-19 at 14:54 +0300, Gavin Simpson wrote:
I know that at least Jane Elith has an email address (I have used it
years ago), so you could ask her. However, it may be  that their hurdle
model uses just Poisson, and there is a minor mistake in their Table 2. 

You can use quasipoisson() or poisson() in glm() in a very natural way:
the fitting happens via iteratively reweighted least squares, and all
you need to define is the relationship between fitted values and
variance. If you look at poisson() and quasipoisson() functions in R
(these provide the backbone of the glm(..., family=)), you see that the
differences are that quasipoissoin()$aic() always returns NA, and
quasipoisson() lacks item simulate(). Otherwise they work in a similar
way. Except in poisson() you take the scale (\phi) to be 1, and in
quasipoisson() you estimate the scale from the fitted model. Then you
just multiply standard errors with the scale, use F tests instead of
Chisq in anova() etc.

I am not sure (or actually, I don't think) that this fitting parallelism
extends to *truncated* Poisson that is used in pscl::hurdle(). Although
you can do fitting by stages, and fit quasipoisson() glm for above-zero
values, I don't think this is the correct thing to do when you are not
allowed to have new zeros. However, the truncated poisson likelihood
model is a huge improvement over hand-fitting glm with iteratively
reweighted least squares and assuming constant variance/fit
relationship.

If you are worried about the overdispersion of the above-zero count
data, use the truncated negative binomial model offerred by
pscl::hurdle(). It is designed for the purpose (and has a more exciting
narrative for ecologists).

Cheers, Jari Oksanen
#
Dear All,

I had a quick look at the internal functions used by pscl::hurdle to
do the numerical optimization by optim. It clearly corresponds to the
hurdle model defined in the paper/vignette, where the zero component
is based on a right censored random variable, that is 0 if the
original count data is 0 and 1 otherwise. The likelihood function for
the zero model corresponds to a censored Poisson model. The count
estimation part is based on left truncated Poisson. This is a
conditional inference thinking, the zero model tells us what
determines if the data is 0 or >0, and once the observations are >0,
than what determines the exact count. If estimates from the 2 models
are identical, it means that 0s can arise from the same Poisson
distribution as the counts. So this is not really a mixture as it is
the case with the zeroinfl() model. The resulting log-likelihood is
still valid, and table 1 clearly states that the hurdle model is based
on ML (maximum likelihood).

It is not the same estimating procedure as for the quasipoisson, where
a likelihood-like function is used to get estimates (note that
parameter estimates are the same as for Poisson, but SEs and the
dispaersion parameter are different).

To handle other overdispersion than zero inflation, one can choose NB
instead of Poisson in hurdle. The quasipoisson family is not allowed
there.

Cheers,

Peter

P?ter S?lymos
Alberta Biodiversity Monitoring Institute
and Boreal Avian Modelling project
Department of Biological Sciences
CW 405, Biological Sciences Bldg
University of Alberta
Edmonton, Alberta, T6G 2E9, Canada
Phone: 780.492.8534
Fax: 780.492.7635
email <- paste("solymos", "ualberta.ca", sep = "@")
http://www.abmi.ca
http://sites.google.com/site/psolymos
On Thu, Aug 19, 2010 at 7:22 AM, Jari Oksanen <jari.oksanen at oulu.fi> wrote:
#
Yingjie Zhang wrote:
I don't know what the "classical" model is.  I would be tempted to fit
this as a three-category ordinal model (i.e. estimating the probability
of (0, 1-20, >20) [putting the breakpoint wherever it seems to make
sense].  It doesn't make much sense to me to try to fit numbers in the
range 1-20 and numbers in the range of thousands together into the same
distribution -- presumably (??) three different processes are occurring.
(How informative do you think the range of variation in the middle
(1-20) category is?  In the extreme (100-) category?) Having 1000
species adds another complexity --it is less than optimal to fit 1000
separate responses. Random effects model, possibly with phylogenetic or
functional groupings included?  Unfortunately, putting these pieces
together -- ordinal models, random effects, large data sets -- makes the
problem challenging.  I would check out the ordinal and MCMCglmm packages.