Skip to content

Predictions from zero-inflated or hurdle models

11 messages · Ben Bolker, Ruben Arslan, Jarrod Hadfield

#
Dear list,

I wanted to ask: Is there any (maybe just back of the envelope) way to
obtain a response prediction for zero-inflated or hurdle type models?
I've fit such models in MCMCglmm, but I don't work in ecology and my
previous experience with explaining such models to "my audience" did not
bode well. When it comes to humans, the researchers I presented to are not
used to offspring count being zero-inflated (or acquainted with that
concept), but in my historical data with high infant mortality, it is (in
modern data it's actually slightly underdispersed).

Currently I'm using lme4 and simply splitting my models into two stages
(finding a mate and having offspring).
That's okay too, but in one population the effect of interest is not
clearly visible in either stage, only when both are taken together (but
then the outcome is zero-inflated).
I expect to be given a hard time for this and hence thought I'd use a
binomial model with the outcome offspring>0 as my main model, but that
turns out to be hard to explain too and doesn't really do the data justice.

Basically I don't want to be forced to discuss my smallest population as a
non-replication of the effect because I was insufficiently able to explain
the statistics behind my reasoning that the effect shows.

Best regards,

Ruben Arslan

Georg August University G?ttingen
Biological Personality Psychology
Georg Elias M?ller Institute of Psychology
Go?lerstr. 14
37073 G?ttingen
Germany
Tel.: +49 551 3920704
https://psych.uni-goettingen.de/en/arslan
https://formr.org
#
Ruben Arslan <rubenarslan at ...> writes:
I think the back-of-the envelope answer would be that for a two-stage
model with a prediction of p_i for the probability of having a non-zero
response (or in the case of zero-inflated models, the probability of
_not_ having a structural zero) and a prediction of n_i for the conditional
part of the model, the mean predicted value is p_i*n_i and the 
variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2)
(this is assuming
that you haven't built in any correlation between p_i and n_i, which
would be hard in lme4 but _might_ be possible under certain circumstances
via a multitype model in MCMCglmm).

  Does that help?
#
Dear Dr Bolker,

I'd thought about something like this, one point of asking was to see whether
a) it's implemented already, because I'll probably make dumb mistakes while
trying b) it's not implemented because it's a bad idea.
Your response and the MCMCglmm course notes make me hope that it's c) not
implemented because nobody did yet or d) it's so simple that everybody does
it on-the-fly.

So I tried my hand and would appreciate corrections. I am sure there is
some screw-up or an inelegant approach in there.
I included code for dealing with mcmc.lists because that's what I have and
I'm not entirely sure how I deal with them is correct either.

I started with a zero-altered model, because those fit fastest and
according to the course notes have the least complex likelihood.
Because I know not what I do, I'm not dealing with my random effects at all.

I pasted a model summary below to show what I've applied the below function
to. The function gives the following quantiles when applied to 19 chains of
that model.
         5%         50%         95%
5.431684178 5.561211207 5.690655200
         5%         50%         95%
5.003974382 5.178192327 5.348246558

Warm regards,

Ruben Arslan

HPDpredict_za = function(object, predictor) {

if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
 intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
 py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0

py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept +  l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40), geom =
'violin'))
print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
invisible(df)
}
Iterations = 100001:299901
 Thinning interval  = 100
 Sample size  = 2000

 DIC: 349094

 G-structure:  ~idh(trait):idParents

                      post.mean l-95% CI u-95% CI eff.samp
children.idParents       0.0189   0.0164   0.0214     1729
za_children.idParents    0.2392   0.2171   0.2622     1647

 R-structure:  ~idh(trait):units

                  post.mean l-95% CI u-95% CI eff.samp
children.units        0.144    0.139    0.148     1715
za_children.units     1.000    1.000    1.000        0

 Location effects: children ~ trait * (maternalage.factor + paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)

                                           post.mean  l-95% CI  u-95% CI
eff.samp   pMCMC
(Intercept)                                 2.088717  2.073009  2.103357
  2000 <0.0005 ***
traitza_children                           -1.933491 -1.981945 -1.887863
  2000 <0.0005 ***
maternalage.factor(14,20]                   0.007709 -0.014238  0.027883
  1500   0.460
maternalage.factor(35,50]                   0.006350 -0.009634  0.024107
  2000   0.462
paternallossTRUE                            0.000797 -0.022716  0.025015
  2000   0.925
maternallossTRUE                           -0.015542 -0.040240  0.009549
  2000   0.226
center(nr.siblings)                         0.005869  0.004302  0.007510
  2000 <0.0005 ***
birth.cohort(1703,1722]                    -0.045487 -0.062240 -0.028965
  2000 <0.0005 ***
birth.cohort(1722,1734]                    -0.055872 -0.072856 -0.036452
  2000 <0.0005 ***
birth.cohort(1734,1743]                    -0.039770 -0.056580 -0.020907
  2000 <0.0005 ***
birth.cohort(1743,1750]                    -0.030713 -0.048301 -0.012214
  2000   0.002 **
urban                                      -0.076748 -0.093240 -0.063002
  2567 <0.0005 ***
male                                        0.106074  0.095705  0.115742
  2000 <0.0005 ***
paternalage.mean                           -0.024119 -0.033133 -0.014444
  2000 <0.0005 ***
paternalage.diff                           -0.018367 -0.032083 -0.005721
  2000   0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432 -0.051978
  1876   0.001 ***
traitza_children:maternalage.factor(35,50] -0.045196 -0.094485  0.002640
  2000   0.075 .
traitza_children:paternallossTRUE          -0.171957 -0.238218 -0.104820
  2000 <0.0005 ***
traitza_children:maternallossTRUE          -0.499539 -0.566825 -0.430637
  2000 <0.0005 ***
traitza_children:center(nr.siblings)       -0.023723 -0.028676 -0.018746
  1848 <0.0005 ***
traitza_children:birth.cohort(1703,1722]   -0.026012 -0.074250  0.026024
  2000   0.319
traitza_children:birth.cohort(1722,1734]   -0.279418 -0.329462 -0.227187
  2000 <0.0005 ***
traitza_children:birth.cohort(1734,1743]   -0.260165 -0.312659 -0.204462
  2130 <0.0005 ***
traitza_children:birth.cohort(1743,1750]   -0.481457 -0.534568 -0.426648
  2000 <0.0005 ***
traitza_children:urban                     -0.604108 -0.645169 -0.562554
  1702 <0.0005 ***
traitza_children:male                      -0.414988 -0.444589 -0.387005
  2000 <0.0005 ***
traitza_children:paternalage.mean           0.006545 -0.018570  0.036227
  2000   0.651
traitza_children:paternalage.diff          -0.097982 -0.136302 -0.060677
  2000 <0.0005 ***
On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:

            

  
  
7 days later
#
Dear list,

I've made a reproducible example of the zero-altered prediction,
in the hope that someone will have a look and reassure me that I'm going
about this the right way.
I was a bit confused by the point about p_i and n_i being correlated (they
are in my case), but I think this was a red herring for me
since I'm not deriving the variance analytically.
The script is here: https://gist.github.com/rubenarslan/aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an RDS
file of the fit here:
https://dl.dropboxusercontent.com/u/1078620/m1.rds

Best regards,

Ruben Arslan
On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com> wrote:

            

  
  
#
Hi,

Sorry - I should have replied to this post earlier. I've been working  
on predict/simulate methods for all MCMcglmm models (including  
zero-inflated/altered/hurdle/truncated models) and these are now  
largely complete (together with newdata options).

When predictions are to be taken after marginalising the random  
effects (including the `residual' over-dispersion) it is not possible  
to obtain closed form expressions. The options that will be available  
in MCMCglmm are:

1) algebraic approximation
2) numerical integration
3) simulation

1) and 2) are currently only accurate when the random/residual effect  
structure implies no covariance between the Poisson part and the  
binary part.

1) is reasonably accurate for zero-inflated distributions, but can be  
pretty poor for the remainder because they all involve zero-truncated  
Poisson log-normal distributions and my taylor approximation for the  
mean is less than ideal (any suggestions would be helpful).

2) could be extended to double integration in which case covariance  
between the Poisson part and the binary part could be handled.

In your code, part of the problem is that you have fitted a zapoisson,  
but the prediction is based on a zipoisson (with complementary log-log  
link rather than logt link).

In all zero-inflated/altered/hurdle/truncated models

E[y] = E[(1-prob)*meanc]

where prob is the probabilty of a zero in the binary part and meanc is  
the mean of a Poisson distribution (zipoisson) or a zero-truncated  
poisson (zapoisson and hupoisson). If we have eta_1 as the linear  
predictor for the poisson part and eta_2 as the linear predictor for  
the binary part:

In zipoisson: prob = plogis(eta_2)     and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2))  and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2)     and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0                 and meanc =  
exp(eta_1)/(1-exp(-exp(eta_1)))

In each case the linear predictor has a `fixed' part and a `random'  
part which I'll denote as `a' and `u' respectively. Ideally we would  
want

E[(1-prob)*meanc] taken over u_1 & u_2

if prob and meanc are independent this is a bit easier as

E[y] = E[1-prob]E[meanc]

and the two expectations ony need to taken with repsect to their  
respective random effects. If we have sd_1 and sd_2 as the standard  
deviations of the two sets of random effects then for the zapoisson:

   normal.evd<-function(x, mu, v){
      exp(-exp(x))*dnorm(x, mu, sqrt(v))
   }
   normal.zt<-function(x, mu, v){
     exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
   }

   pred<-function(a_1, a_2, sd_1, sd_2){
     prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),  
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
     meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),  
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
     prob*meanc
   }

#  gives the expected value with reasonable accuracy.  As an example:

   x<-rnorm(300)
   l1<-rnorm(300, 1/2+x, sqrt(1))
   l2<-rnorm(300, 1-x, sqrt(1))

   y<-rbinom(300, 1, 1-exp(-exp(l2)))
   y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,  
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
   # cunning sampler from Peter Dalgaard (R-sig-mixed)

   data=data.frame(y=y, x=x)
   prior=list(R=list(V=diag(2), nu=0.002, fix=2))

   m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,  
family="zapoisson", prior=prior)

   b_1<-colMeans(m1$Sol)[c(1,3)]
   b_2<-colMeans(m1$Sol)[c(2,4)]
   sd_1<-mean(sqrt(m1$VCV[,1]))
   sd_2<-mean(sqrt(m1$VCV[,2]))

   # note it is more accurate to take the posterior mean prediction  
rather than the prediction from the posterior means as I've done here,  
but for illustration:

   x.pred<-seq(-3,3,length=100)
   p<-1:100
   for(i in 1:100){
     p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =  
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
   }

   plot(y~x)
   lines(p~x.pred)

Cheers,

Jarrod




Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015  
13:33:25 +0000:

  
    
#
Hi Jarrod,

thanks for the extensive reply! This helps a lot, though it sounds like I
was hubristic to attempt this myself.
I tried using the approach you mapped out in the function gist
<https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I
simply put the pred function in a loop, so that I wouldn't make any
mistakes while vectorising and since I don't care about performance at this
point.

Of course, I have some follow up questions though.. I'm sorry if I'm being
a pain, I really appreciate the free advice, but understand of course if
other things take precedence.

1. You're not "down with" developing publicly are you? Because I sure would
like to test-drive the newdata prediction and simulation functions..

2. Could you make sure that I got this right: "When predictions are to be
taken after marginalising the random effects (including the `residual'
over-dispersion) it is not possible to obtain closed form expressions."
That is basically my scenario, right? In the example I included, I also had
a group-level random effect (family). Ore are you talking about the "trait"
as the random effect (as in your example) and my scenario is different and
I cannot apply the numerical double integration procedure you posted?
To be clear about my prediction goal without using language that I might be
using incorrectly: I want to show what the average effect in the response
unit, number of children, is in my population(s). I have data on whole
populations and am using all of it (except individuals that don't have
completed fertility yet, because I have yet to find a way to model both
zero-inflation and right censoring).

3. "Numerical integration could be extended to double integration in which
case covariance between the Poisson part and the binary part could be
handled." That is what you posted an example of and it applies to my
scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2)
and rcov=~idh(trait):units, random=~idh(trait):idParents?
But this double integration approach is something you just wrote
off-the-cuff and I probably shouldn't use it in a publication? Or is this
in the forthcoming MCMCglmm release and I might actually be able to refer
to it once I get to submitting?

4. Could I change my model specification to forbid covariance between the
two parts and not shoot myself in the foot? Would this allow for a more
valid/tested approach to prediction?

5. When I use your method on my real data, I get less variation around the
prediction "for the reference level" than for all other factor levels.
My reference level actually has fewer cases than the others, so this isn't
"right" in a way.
Is this because I'm not doing newdata prediction? I get the "right" looking
uncertainty if I bootstrap newdata predictions in lme4,
Sorry if this is children's logic :-)
Here's an image of the prediction
<http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data
<http://rpubs.com/rubenarslan/raw>.

Many thanks for any answers that you feel inclined to give.

Best wishes,

Ruben
On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:

            

  
  
#
Hi,

two more questions, sorry.

6.
could it be that where you wrote
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}

you're not actually passing the variance but the standard deviation? so the
functions should be

normal.evd<-function(x, mu, sd){
exp(-exp(x))*dnorm(x, mu, sd)
}
normal.zt<-function(x, mu, sd){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
}

Because at least in the example, you were taking the square root here
already: sd_1<-mean(sqrt(m1$VCV[,1]))
Maybe this was a copy-paste from the source where you do it differently? Or
I misunderstood...

An addition to question 5:
5. When I use the correct inverse link function in my old code, but don't
do double integration, I do get a "reasonable" amount of uncertainty around
the reference factor. The predicted means are too low, which I put down to
this approach ignoring the VCV. I was just wondering how this could happen
(actually seeing the "reasonable" amount of uncertainty was what made me
hopeful that my approach might not be entirely wrong).

Best,

Ruben
On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com> wrote:

            

  
  
#
Hi,

1/ I have no problem with it - I'm just lazy about it.

2/ This is the case for all GLMM. In your case you have two sources of  
variation to average over: family effects and units  
(observation-level) effects. However, they are independent of each  
other and so you can replace (for example) u_1 with family_1+units_1  
and v_1 with v_family+v_units.

3/ In your model the zero-alteration and the Poisson process are  
treated as independent as you use an idh structure. Here, the  
covariance is set to zero. You could use an us structure which would  
then allow a non-zero covariance between family effects (the  
covariance is not identifiable at the units level). I did not post a  
double integration example, the integration was of the form


int_{u_{1}} 1-prob d_{u_{1}} * int_{u_{2}} meanc d_{u_{2}}

Double integration would involve

int_{u_{1}}int_{u_{2}} (1-prob)*meanc d_{u_{2}}d_{u_{1}}

For example with have a = [a_{1}, a_{2}]'  and V the covariance matrix  
of random effects.


library(mvtnorm)
library(cubature)

normal.zap<-function(x, mu, V){
  (1-exp(-exp(x[2])))*dmvnorm(x, mu, V)*exp(x[1])/(1-exp(-exp(x[1])))
}

pred2<-function(a,V){

  low1<-qnorm(0.0001, a[1],sqrt(V[1,1]))
  upp1<-qnorm(0.9999, a[1],sqrt(V[1,1]))
  low2<-qnorm(0.0001, a[2],sqrt(V[2,2]))
  upp2<-qnorm(0.9999, a[2],sqrt(V[2,2]))

  adaptIntegrate(normal.zap, c(low1,low2),c(upp1,upp2), mu=a, V=V)$integral

}


a<-runif(2)
V<-diag(2)

pred2(a,V)


should give the same answer as pred when V is diagonal (after fixing  
my mistakes in the last post). The forthcoming version of MCMCglmm  
just has the single integration method, but I guess I could also put  
this into the next version too.


4/ Currently you have zero covariances, but if you think they exist I  
would model them. Remember, that a nice feature of zap models is that  
the standard Poisson model is a special case. In terms of random  
effects, this occurs when the correlation between family effects for  
the two processes is 1 (i.e. fit ~family, rather than  
idh(trait):family).

5/ I'm not sure I understand the question. The uncertainty here is  
arising because of uncertainty in the parameters of the model? If so,  
possibilities are that i) there is more information for the reference  
level because other levels are more confounded with other predictors  
in the model ii) the reference level is associated with larger  
outcomes and so the same change on the link scale generates larger  
differences on the data scale.

Cheers,

Jarrod




Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015  
18:53:33 +0000:

  
    
#
Hi,

6/ you are correct - my mistake.


5b/ I'm still not sure what you mean by the uncertainty, but ignoring  
the random effects will result in under prediction. For a Poisson  
log-normal the mean  is  exp(a_1+v_1/2)  rather than  exp(a_1) for the  
standard Poisson.

Cheers,

Jarrod



Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015  
19:21:06 +0000:

  
    
#
Hi,

thanks for your reply. This clears some things up.

1. Colour me interested then, if you ever find the time to put it on
Github. Maybe after seeing the nonsense I've wrought in the following
points you might find sending me a zip of the pre-release might save you
some time as opposed to guessing where I went wrong?

2. I'm not sure I understand.
You're saying I can add the VCV components for the family random effect to
the respective unit components.
But I would only want that if I was interested in predicting the effect for
any family, which in my case is not the goal, since my population data
includes all families and I want to know the effect for the average family
in that population (to compare it to other populations).

3. Oh right, I'm sorry I think I got a little confused/sidetracked, these
models have been with me for a while now and I learned about us()/idh()
back then.

4. Okay, I will keep this in mind, but am for now focusing on getting the
prediction for this simpler case right.

5. I made a reproducible report
<http://rpubs.com/rubenarslan/repro_variab_mcmcglmm> to show the problem. I
think the problem is not due to the changes I made in my prediction helper
function but to some fundamental.
It shows that for some toy data, the credibility intervals for predictions
depend on the reference I choose for a factor predictor.
This is less bad with the method I used initially simply using the inverse
link.
However, both methods _underpredict_ as far as I can tell.

Surely I'm missing an important step, but I hope this illustrates what
confuses me. Of course the reasons you map out for the discrepancy
(variable confounding etc) are sound in theory, but I don't see these wide
CIs using any other method (simple descriptives, lme4).
Oh and I used a different data simulation, because the one you used
actually induces zero-deflation as far as I can tell (a lot of zeroes, but
lambda is very low too).

Best regards and again thanks so much for taking the time,

Ruben
On Wed, Mar 18, 2015 at 7:06 AM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:

            

  
  
9 days later
#
Hi Ruben,

Regarding 2/ you want to add the family variance if you want to  
predict what the average response will be across families. If you want  
to predict for a specific family (for example one where the random  
effect = 0) then you should not include the variance.

Cheers,

Jarrod



Quoting Ruben Arslan <rubenarslan at gmail.com> on Wed, 18 Mar 2015  
13:51:50 +0000: