Skip to content

mixed mutlinomial regression for count data with, overdisperion & zero-inflation

10 messages · Highland Statistics Ltd, Stéphanie Périquet, dave fournier

#
Incorrect.
1. Calculate the percentage of zeros for your observed data.
2. Fit a model....this can be a model without zero inflation stuff.
3. Simulate 1000 data sets from your model and for each simulated data 
set assess the percentage of zeros.
4. Compare the results in 3 with those in 1.

5. Even nicer....
5a. Plot a simple frequency table for the original data 
(plot(table(Response), type = "h").
5b. Calculate a table() for each of your simulated data.
5c. Calculate the average frequency table.
5d. Compare 5a and 5c.

For a nice example and R code, see:
A protocol for conducting and presenting results of regression-type 
analyses. Zuur & Ieno
doi: 10.1111/2041-210X.12577
Methods in Ecology and Evolution 2016

Comes out in 2 weeks or so.

Kind regards,

Alain

  
    
#
Dear Alain,

Thanks for your reply and advices! Will try to do that and wait for your
very timely paper to come out to be sure I did the right thing!

Best,
Stephanie

On 17 May 2016 at 12:08, Highland Statistics Ltd <highstat at highstat.com>
wrote:

  
    
#
On 17/05/2016 18:53, St?phanie P?riquet wrote:
Stephanie,

Although it does not cover multinomial models directly, this one may be 
of use as well:

Beginner's Guide to Zero-Inflated Models with R (2016). Zuur AF and Ieno EN
http://highstat.com/BGZIM.htm

Sorry for the self-references.

Kind regards,

Alain

  
    
#
Yeah thanks Alain, I'm definitely planning to buy this book!

So I looked at the zeros in my data abased on you advice and I did the
following:
mod<-glmer(count~item+item:season+item:moon+item:season:moon+(1|indiv/obs)+(1|id),family=poisson,nAGQ=0,data=diet3)
z<-simulate(mod,nsim=1000)

For the original data I have 69.3% of zeros while the average over the 1000
simulations is 63.5%.Is there a way to statistically compare these 2
values? Or could you say that these 2 figures are not very different and
then zero inflation models might not be necessary?

Best,
Stephanie

On 17 May 2016 at 20:21, Highland Statistics Ltd <highstat at highstat.com>
wrote:

  
    
#
On 18/05/2016 08:26, St?phanie P?riquet wrote:
Stephanie,

Make a histogram of the 1000 values of the percentages of zeros....and 
present the 69.3% as a big blue/red dot. If the dot for your observed 
data is in the tails you have a problem.

I don't see the point of a test in your case. Such a simulation is close 
to bootstrapping...so I guess you can come up with a test somehow. If 
you do this type of analysis in a Bayesian framework it is often (and 
confusingly) called a Bayesian p-value (counting how often the simulated 
value is larger than your observed one).

I would just go for the histogram...seems you are lucky.

Alain

  
    
#
Ok thanks for the details. And this histogram confirms I have zero
inflation indeed, the value from the original data is out of the
distribution from the simulated data...

Best,
Stephanie

On 18 May 2016 at 09:39, Highland Statistics Ltd <highstat at highstat.com>
wrote:

  
    
1 day later
#
Hi,

I ran your data outside of R.  It was immediately clear that the version 
of glmmADMB I had
in R was missing a feature.  the additional feature was the rescaling of 
badly scaled parameters
in the model.  Aftewr a simple rescaling I ran your data according to 
your model hypothesis
using both ZI and non ZI and Neg Bin type 1 and type 2 overdispersion.  
the model converged easily
and produced the following results.  the Objective function value is the 
negative log likelihood so smaller is better.
clearly zero inflation is indicated and type 2 fits a lot better than 
type 1.





Type 2
# Number of parameters = 33  Objective function value = 5698.40 Maximum 
gradient component = 7.96141e-05
# pz:
  0.0000
# Number of parameters = 34  Objective function value = 5672.69 Maximum 
gradient component = 4.96526e-05
# pz:
0.118535625347
Type 1
# Number of parameters = 33  Objective function value = 5954.95 Maximum 
gradient component = 9.43931e-05
# pz:
  0.0000
# Number of parameters = 34  Objective function value = 5936.85 Maximum 
gradient component = 7.44436e-05
# pz:
0.0684308821563
#
I found a version of glmmadmb for Windows 64bit ( I sadly assume) that 
almost does the job out of the
box. Using the information from R Forge

    http://glmmadmb.r-forge.r-project.org/

which points to

       http://www.admb-project.org/buildbot/glmmadmb/

where I found this exe

             glmmadmb-mingw64-r3274-windows10-mingw64.exe

  Using your glmmadmb.pin and glmmadmb.dat files (which you renamed to
glmmadmb1.pin and glmmadmb1.dat) I ran the following script.

    ./glmmadmb-mingw64-r3274-windows10-mingw64.exe -crit 1.e-6  -ind 
glmmadmb1.dat -ainp glmmadmb.par -maxph 5 -shess -noinit -phase 7

(I have added two extra phases and modified the convergence criterion to 
1.e-6.)

which converged with nice estimates etc.   Now one of the really nice 
things about actually fitting the model
rather than resorting to other diagnostics is that if you are successful 
you can look at the fit.  In this
case the squared residuals divided by the predicted variance.  These are 
in the output glmmadmb.rep file.

I sorted them in R and looked at the large ones at the end.


index           obs    pred             whatever its called
1206 1206  319 3.47536e+01 1.24498e+01
305   305 4490 2.18655e+03 1.29949e+01
1074 1074  413 5.13552e+01 1.36380e+01
1385 1385 4002 1.68879e+03 1.69679e+01
854   854  224 1.22219e+01 1.96515e+01
1691 1691 2713 8.33316e+02 2.27056e+01
1427 1427 1732 3.92621e+02 2.44684e+01
1433 1433 1612 3.25266e+02 2.72590e+01
1313 1313 1815 3.52356e+02 3.25137e+01
341   341 2031 3.55824e+02 4.22336e+01
191   191 5814 7.18097e+02 1.93656e+02
599   599 3586 2.68911e+02 2.19118e+02

So you have a few gigantic outliers.  This is fairly common with data 
for which the model
has problems converging.  But rather than celebrating the failure of the 
model as a useful diagnostic for
bad data it is really useful to coax it to fit the data and investigate 
the residuals,
If you can explain them it should help.
#
Hi Dave,

Thanks for all this detailed info. I'm on Mac unfortunately. But as the
model with nbinom2 runs and is a better fit, I'll inspect the rep file fo
this one. Or is there a way to access this directly on R?

Best,
Stephanie
On 20 May 2016 at 05:05, dave fournier <davef at otter-rsch.com> wrote:

            

  
    
#
My mac virtual machine is actually feeling more chipper than my windows 
virtual machine today
so I'll build you a special version that seems to tease out solutions 
for both nb1 and nb2 and
email it.  But the real issues I think are diagnosing problems with 
difficult data sets.

One problem is that the glmmadmb r package simply reports that the 
Hessian is not positive definite
and quits.  see this link

https://stat.ethz.ch/pipermail/r-sig-mixed-models/2016q1/024527.html

for a case where one could conclude that the problem with fitting the 
model was due to
confounding between the zero inflation and overdispersion.

Now with your model for one run I identified the negative eigenvalue  
-15.28948399 of the Hessian with the


     eigenvalues unsorted:    1.917979995e-10 -0.0004551907790 
0.001293591754 0.003821048249 0.01431672721 0.1009762164 0.03925858165 
0.07403629854 0.1091338760 0.1562519927 0.1703625637 0.1927551515 
0.2234392242 0.2309947849 0.3469581818 0.3041749405 0.3580105168 
0.3942153084 0.4397529164 0.5078767603 0.5728201455 0.6012492489 
0.6789170419 0.7369245582 0.7971275668 0.8833795401 0.9287787445 
0.9508016682 1.037844369 1.049178898 1.707802018 2.724758782 3.365700202 
-15.28948399

eigenvector
  -0.0006942405368 -0.07288724821  0.04506821612  0.08379747141 
0.1184035873   0.1124181332   0.4898312779   0.2647606687 
-0.005219962867 -0.01694772700 -0.02195235540 -0.0004078488080 
-0.1039487736   0.4007922401 -0.007979620610 -0.02801923429 
-0.03638402585 -0.0004982359168  -0.1679864668   0.6019723457 
-0.01250628628 -0.04275353216 -0.05511398984 -0.001275366327 
-0.2523634973  0.09075331140 -0.0008355685062 -0.005755760214 
-0.007503459862 0.0001200448790 -0.02765030934 0.0001257008991 
-0.009465390476 -0.07780205094

This is a more difficult case as it seems to involve almost all the 
parameters. However the largest ones are all
for the parameters of the linear predictor.  So it is saying that maybe 
your model is a bit overparameterized
or equivalently that the parameters of the linear predictor are a bit 
confounded.

Now in linear regression models one can try to deal with this situation 
by  employing ridge regression.
Really this is just putting a quadratic penalty on the parameters. We 
can do this and decrease the size of the penalty
in stages and finally if desired doing away with it entirely. I set this 
up the version of glmmadmb I am sending you.

However that does not deal with your outlier problem.  For some reason a 
lot of count data analyses get published without any analysis of the 
residuals (at this point a disparaging remark about sociology is 
probably in order).

These are the worst outliers for nb1 and nb2 models
for your data

1074 1074  413 5.13552e+01 1.36380e+01
1385 1385 4002 1.68879e+03 1.69679e+01
854   854  224 1.22219e+01 1.96515e+01
1691 1691 2713 8.33316e+02 2.27056e+01
1427 1427 1732 3.92621e+02 2.44684e+01
1433 1433 1612 3.25266e+02 2.72590e+01
1313 1313 1815 3.52356e+02 3.25137e+01
341   341 2031 3.55824e+02 4.22336e+01
191   191 5814 7.18097e+02 1.93656e+02
599   599 3586 2.68911e+02 2.19118e+02

1385 1385 4002 1.24681e+03 2.36563e+01
335   335 3012 4.87436e+02 5.08038e+01
1427 1427 1732 1.64012e+02 5.82439e+01
1433 1433 1612 1.39872e+02 6.02005e+01
1313 1313 1815 1.29395e+02 8.53171e+01
341   341 2031 1.39804e+02 9.94021e+01
1691 1691 2713 2.16615e+02 1.11783e+02
191   191 5814 6.96952e+02 1.45975e+02
599   599 3586 1.46454e+02 3.13865e+02

The term 3.13865e+02 correspond to a residual of over 17 standard 
deviations.
One might expect that the influence of these large outliers has large 
influence
on the parameter estimates and will invalidate any significance tests 
one might
want carry out.