Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way to obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did not bode well. When it comes to humans, the researchers I presented to are not used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is (in modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two stages (finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together (but then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but that turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population as a non-replication of the effect because I was insufficiently able to explain the statistics behind my reasoning that the effect shows. Best regards, Ruben Arslan Georg August University G?ttingen Biological Personality Psychology Georg Elias M?ller Institute of Psychology Go?lerstr. 14 37073 G?ttingen Germany Tel.: +49 551 3920704 https://psych.uni-goettingen.de/en/arslan https://formr.org
Predictions from zero-inflated or hurdle models
11 messages · Ben Bolker, Ruben Arslan, Jarrod Hadfield
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way to obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did not bode well. When it comes to humans, the researchers I presented to are not used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is (in modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two stages (finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together (but then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but that turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population as a non-replication of the effect because I was insufficiently able to explain the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2) (this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain circumstances via a multitype model in MCMCglmm). Does that help?
Dear Dr Bolker,
I'd thought about something like this, one point of asking was to see whether
a) it's implemented already, because I'll probably make dumb mistakes while
trying b) it's not implemented because it's a bad idea.
Your response and the MCMCglmm course notes make me hope that it's c) not
implemented because nobody did yet or d) it's so simple that everybody does
it on-the-fly.
So I tried my hand and would appreciate corrections. I am sure there is
some screw-up or an inelegant approach in there.
I included code for dealing with mcmc.lists because that's what I have and
I'm not entirely sure how I deal with them is correct either.
I started with a zero-altered model, because those fit fastest and
according to the course notes have the least complex likelihood.
Because I know not what I do, I'm not dealing with my random effects at all.
I pasted a model summary below to show what I've applied the below function
to. The function gives the following quantiles when applied to 19 chains of
that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40), geom =
'violin'))
print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
invisible(df)
}
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor + paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95% CI
eff.samp pMCMC
(Intercept) 2.088717 2.073009 2.103357
2000 <0.0005 ***
traitza_children -1.933491 -1.981945 -1.887863
2000 <0.0005 ***
maternalage.factor(14,20] 0.007709 -0.014238 0.027883
1500 0.460
maternalage.factor(35,50] 0.006350 -0.009634 0.024107
2000 0.462
paternallossTRUE 0.000797 -0.022716 0.025015
2000 0.925
maternallossTRUE -0.015542 -0.040240 0.009549
2000 0.226
center(nr.siblings) 0.005869 0.004302 0.007510
2000 <0.0005 ***
birth.cohort(1703,1722] -0.045487 -0.062240 -0.028965
2000 <0.0005 ***
birth.cohort(1722,1734] -0.055872 -0.072856 -0.036452
2000 <0.0005 ***
birth.cohort(1734,1743] -0.039770 -0.056580 -0.020907
2000 <0.0005 ***
birth.cohort(1743,1750] -0.030713 -0.048301 -0.012214
2000 0.002 **
urban -0.076748 -0.093240 -0.063002
2567 <0.0005 ***
male 0.106074 0.095705 0.115742
2000 <0.0005 ***
paternalage.mean -0.024119 -0.033133 -0.014444
2000 <0.0005 ***
paternalage.diff -0.018367 -0.032083 -0.005721
2000 0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432 -0.051978
1876 0.001 ***
traitza_children:maternalage.factor(35,50] -0.045196 -0.094485 0.002640
2000 0.075 .
traitza_children:paternallossTRUE -0.171957 -0.238218 -0.104820
2000 <0.0005 ***
traitza_children:maternallossTRUE -0.499539 -0.566825 -0.430637
2000 <0.0005 ***
traitza_children:center(nr.siblings) -0.023723 -0.028676 -0.018746
1848 <0.0005 ***
traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250 0.026024
2000 0.319
traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462 -0.227187
2000 <0.0005 ***
traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659 -0.204462
2130 <0.0005 ***
traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568 -0.426648
2000 <0.0005 ***
traitza_children:urban -0.604108 -0.645169 -0.562554
1702 <0.0005 ***
traitza_children:male -0.414988 -0.444589 -0.387005
2000 <0.0005 ***
traitza_children:paternalage.mean 0.006545 -0.018570 0.036227
2000 0.651
traitza_children:paternalage.diff -0.097982 -0.136302 -0.060677
2000 <0.0005 ***
On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way to obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did not bode well. When it comes to humans, the researchers I presented to are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is (in modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two stages (finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together (but then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but that turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population as
a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2) (this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain circumstances via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
7 days later
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated (they are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/aeacdd306b3d061819a6 and if you don't want to run the simulation fit yourself, I've put an RDS file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan
On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
Dear Dr Bolker,
I'd thought about something like this, one point of asking was to see whether
a) it's implemented already, because I'll probably make dumb mistakes while
trying b) it's not implemented because it's a bad idea.
Your response and the MCMCglmm course notes make me hope that it's c) not
implemented because nobody did yet or d) it's so simple that everybody does
it on-the-fly.
So I tried my hand and would appreciate corrections. I am sure there is
some screw-up or an inelegant approach in there.
I included code for dealing with mcmc.lists because that's what I have and
I'm not entirely sure how I deal with them is correct either.
I started with a zero-altered model, because those fit fastest and
according to the course notes have the least complex likelihood.
Because I know not what I do, I'm not dealing with my random effects at
all.
I pasted a model summary below to show what I've applied the below
function to. The function gives the following quantiles when applied to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40), geom =
'violin'))
print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
invisible(df)
}
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor + paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95% CI
eff.samp pMCMC
(Intercept) 2.088717 2.073009 2.103357
2000 <0.0005 ***
traitza_children -1.933491 -1.981945 -1.887863
2000 <0.0005 ***
maternalage.factor(14,20] 0.007709 -0.014238 0.027883
1500 0.460
maternalage.factor(35,50] 0.006350 -0.009634 0.024107
2000 0.462
paternallossTRUE 0.000797 -0.022716 0.025015
2000 0.925
maternallossTRUE -0.015542 -0.040240 0.009549
2000 0.226
center(nr.siblings) 0.005869 0.004302 0.007510
2000 <0.0005 ***
birth.cohort(1703,1722] -0.045487 -0.062240 -0.028965
2000 <0.0005 ***
birth.cohort(1722,1734] -0.055872 -0.072856 -0.036452
2000 <0.0005 ***
birth.cohort(1734,1743] -0.039770 -0.056580 -0.020907
2000 <0.0005 ***
birth.cohort(1743,1750] -0.030713 -0.048301 -0.012214
2000 0.002 **
urban -0.076748 -0.093240 -0.063002
2567 <0.0005 ***
male 0.106074 0.095705 0.115742
2000 <0.0005 ***
paternalage.mean -0.024119 -0.033133 -0.014444
2000 <0.0005 ***
paternalage.diff -0.018367 -0.032083 -0.005721
2000 0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432 -0.051978
1876 0.001 ***
traitza_children:maternalage.factor(35,50] -0.045196 -0.094485 0.002640
2000 0.075 .
traitza_children:paternallossTRUE -0.171957 -0.238218 -0.104820
2000 <0.0005 ***
traitza_children:maternallossTRUE -0.499539 -0.566825 -0.430637
2000 <0.0005 ***
traitza_children:center(nr.siblings) -0.023723 -0.028676 -0.018746
1848 <0.0005 ***
traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250 0.026024
2000 0.319
traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462 -0.227187
2000 <0.0005 ***
traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659 -0.204462
2130 <0.0005 ***
traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568 -0.426648
2000 <0.0005 ***
traitza_children:urban -0.604108 -0.645169 -0.562554
1702 <0.0005 ***
traitza_children:male -0.414988 -0.444589 -0.387005
2000 <0.0005 ***
traitza_children:paternalage.mean 0.006545 -0.018570 0.036227
2000 0.651
traitza_children:paternalage.diff -0.097982 -0.136302 -0.060677
2000 <0.0005 ***
On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way to obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did not bode well. When it comes to humans, the researchers I presented to are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two stages (finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together (but then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but that turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2) (this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain circumstances via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated (they are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/aeacdd306b3d061819a6 and if you don't want to run the simulation fit yourself, I've put an RDS file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
Dear Dr Bolker,
I'd thought about something like this, one point of asking was to
see whether
a) it's implemented already, because I'll probably make dumb mistakes while
trying b) it's not implemented because it's a bad idea.
Your response and the MCMCglmm course notes make me hope that it's c) not
implemented because nobody did yet or d) it's so simple that everybody does
it on-the-fly.
So I tried my hand and would appreciate corrections. I am sure there is
some screw-up or an inelegant approach in there.
I included code for dealing with mcmc.lists because that's what I have and
I'm not entirely sure how I deal with them is correct either.
I started with a zero-altered model, because those fit fastest and
according to the course notes have the least complex likelihood.
Because I know not what I do, I'm not dealing with my random effects at
all.
I pasted a model summary below to show what I've applied the below
function to. The function gives the following quantiles when applied to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40), geom =
'violin'))
print(quantile(at_intercept, probs = c(0.05,0.5,0.95)))
print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95)))
invisible(df)
}
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor + paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95% CI
eff.samp pMCMC
(Intercept) 2.088717 2.073009 2.103357
2000 <0.0005 ***
traitza_children -1.933491 -1.981945 -1.887863
2000 <0.0005 ***
maternalage.factor(14,20] 0.007709 -0.014238 0.027883
1500 0.460
maternalage.factor(35,50] 0.006350 -0.009634 0.024107
2000 0.462
paternallossTRUE 0.000797 -0.022716 0.025015
2000 0.925
maternallossTRUE -0.015542 -0.040240 0.009549
2000 0.226
center(nr.siblings) 0.005869 0.004302 0.007510
2000 <0.0005 ***
birth.cohort(1703,1722] -0.045487 -0.062240 -0.028965
2000 <0.0005 ***
birth.cohort(1722,1734] -0.055872 -0.072856 -0.036452
2000 <0.0005 ***
birth.cohort(1734,1743] -0.039770 -0.056580 -0.020907
2000 <0.0005 ***
birth.cohort(1743,1750] -0.030713 -0.048301 -0.012214
2000 0.002 **
urban -0.076748 -0.093240 -0.063002
2567 <0.0005 ***
male 0.106074 0.095705 0.115742
2000 <0.0005 ***
paternalage.mean -0.024119 -0.033133 -0.014444
2000 <0.0005 ***
paternalage.diff -0.018367 -0.032083 -0.005721
2000 0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432 -0.051978
1876 0.001 ***
traitza_children:maternalage.factor(35,50] -0.045196 -0.094485 0.002640
2000 0.075 .
traitza_children:paternallossTRUE -0.171957 -0.238218 -0.104820
2000 <0.0005 ***
traitza_children:maternallossTRUE -0.499539 -0.566825 -0.430637
2000 <0.0005 ***
traitza_children:center(nr.siblings) -0.023723 -0.028676 -0.018746
1848 <0.0005 ***
traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250 0.026024
2000 0.319
traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462 -0.227187
2000 <0.0005 ***
traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659 -0.204462
2130 <0.0005 ***
traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568 -0.426648
2000 <0.0005 ***
traitza_children:urban -0.604108 -0.645169 -0.562554
1702 <0.0005 ***
traitza_children:male -0.414988 -0.444589 -0.387005
2000 <0.0005 ***
traitza_children:paternalage.mean 0.006545 -0.018570 0.036227
2000 0.651
traitza_children:paternalage.diff -0.097982 -0.136302 -0.060677
2000 <0.0005 ***
On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way to obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did not bode well. When it comes to humans, the researchers I presented to are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two stages (finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together (but then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but that turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 + var(n_i)/n_i^2) (this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain circumstances via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like I was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at this point. Of course, I have some follow up questions though.. I'm sorry if I'm being a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation functions.. 2. Could you make sure that I got this right: "When predictions are to be taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure you posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data on whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is this in the forthcoming MCMCglmm release and I might actually be able to refer to it once I get to submitting? 4. Could I change my model specification to forbid covariance between the two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around the prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben
On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an RDS file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's c)
not
implemented because nobody did yet or d) it's so simple that everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure there is some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random effects at all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied to
19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95% CI
eff.samp pMCMC
(Intercept) 2.088717 2.073009 2.103357
2000 <0.0005 ***
traitza_children -1.933491 -1.981945 -1.887863
2000 <0.0005 ***
maternalage.factor(14,20] 0.007709 -0.014238 0.027883
1500 0.460
maternalage.factor(35,50] 0.006350 -0.009634 0.024107
2000 0.462
paternallossTRUE 0.000797 -0.022716 0.025015
2000 0.925
maternallossTRUE -0.015542 -0.040240 0.009549
2000 0.226
center(nr.siblings) 0.005869 0.004302 0.007510
2000 <0.0005 ***
birth.cohort(1703,1722] -0.045487 -0.062240 -0.028965
2000 <0.0005 ***
birth.cohort(1722,1734] -0.055872 -0.072856 -0.036452
2000 <0.0005 ***
birth.cohort(1734,1743] -0.039770 -0.056580 -0.020907
2000 <0.0005 ***
birth.cohort(1743,1750] -0.030713 -0.048301 -0.012214
2000 0.002 **
urban -0.076748 -0.093240 -0.063002
2567 <0.0005 ***
male 0.106074 0.095705 0.115742
2000 <0.0005 ***
paternalage.mean -0.024119 -0.033133 -0.014444
2000 <0.0005 ***
paternalage.diff -0.018367 -0.032083 -0.005721
2000 0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169 -0.562554 1702 <0.0005 *** traitza_children:male -0.414988 -0.444589 -0.387005 2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way
to
obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did
not
bode well. When it comes to humans, the researchers I presented to
are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi,
two more questions, sorry.
6.
could it be that where you wrote
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
you're not actually passing the variance but the standard deviation? so the
functions should be
normal.evd<-function(x, mu, sd){
exp(-exp(x))*dnorm(x, mu, sd)
}
normal.zt<-function(x, mu, sd){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
}
Because at least in the example, you were taking the square root here
already: sd_1<-mean(sqrt(m1$VCV[,1]))
Maybe this was a copy-paste from the source where you do it differently? Or
I misunderstood...
An addition to question 5:
5. When I use the correct inverse link function in my old code, but don't
do double integration, I do get a "reasonable" amount of uncertainty around
the reference factor. The predicted means are too low, which I put down to
this approach ignoring the VCV. I was just wondering how this could happen
(actually seeing the "reasonable" amount of uncertainty was what made me
hopeful that my approach might not be entirely wrong).
Best,
Ruben
On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like I was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at this point. Of course, I have some follow up questions though.. I'm sorry if I'm being a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation functions.. 2. Could you make sure that I got this right: "When predictions are to be taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure you posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data on whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is this in the forthcoming MCMCglmm release and I might actually be able to refer to it once I get to submitting? 4. Could I change my model specification to forbid covariance between the two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around the prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an
RDS
file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's c)
not
implemented because nobody did yet or d) it's so simple that everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure there is some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random effects at all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied
to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95%
CI
eff.samp pMCMC (Intercept) 2.088717 2.073009
2.103357
2000 <0.0005 *** traitza_children -1.933491 -1.981945
-1.887863
2000 <0.0005 *** maternalage.factor(14,20] 0.007709 -0.014238
0.027883
1500 0.460 maternalage.factor(35,50] 0.006350 -0.009634
0.024107
2000 0.462 paternallossTRUE 0.000797 -0.022716
0.025015
2000 0.925 maternallossTRUE -0.015542 -0.040240
0.009549
2000 0.226 center(nr.siblings) 0.005869 0.004302
0.007510
2000 <0.0005 *** birth.cohort(1703,1722] -0.045487 -0.062240
-0.028965
2000 <0.0005 *** birth.cohort(1722,1734] -0.055872 -0.072856
-0.036452
2000 <0.0005 *** birth.cohort(1734,1743] -0.039770 -0.056580
-0.020907
2000 <0.0005 *** birth.cohort(1743,1750] -0.030713 -0.048301
-0.012214
2000 0.002 ** urban -0.076748 -0.093240
-0.063002
2567 <0.0005 *** male 0.106074 0.095705
0.115742
2000 <0.0005 *** paternalage.mean -0.024119 -0.033133
-0.014444
2000 <0.0005 *** paternalage.diff -0.018367 -0.032083
-0.005721
2000 0.007 ** traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169
-0.562554
1702 <0.0005 *** traitza_children:male -0.414988 -0.444589
-0.387005
2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way
to
obtain a response prediction for zero-inflated or hurdle type
models?
I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience"
did not
bode well. When it comes to humans, the researchers I presented to
are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it
is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use
a
binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest
population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a
two-stage
model with a prediction of p_i for the probability of having a
non-zero
response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi,
1/ I have no problem with it - I'm just lazy about it.
2/ This is the case for all GLMM. In your case you have two sources of
variation to average over: family effects and units
(observation-level) effects. However, they are independent of each
other and so you can replace (for example) u_1 with family_1+units_1
and v_1 with v_family+v_units.
3/ In your model the zero-alteration and the Poisson process are
treated as independent as you use an idh structure. Here, the
covariance is set to zero. You could use an us structure which would
then allow a non-zero covariance between family effects (the
covariance is not identifiable at the units level). I did not post a
double integration example, the integration was of the form
int_{u_{1}} 1-prob d_{u_{1}} * int_{u_{2}} meanc d_{u_{2}}
Double integration would involve
int_{u_{1}}int_{u_{2}} (1-prob)*meanc d_{u_{2}}d_{u_{1}}
For example with have a = [a_{1}, a_{2}]' and V the covariance matrix
of random effects.
library(mvtnorm)
library(cubature)
normal.zap<-function(x, mu, V){
(1-exp(-exp(x[2])))*dmvnorm(x, mu, V)*exp(x[1])/(1-exp(-exp(x[1])))
}
pred2<-function(a,V){
low1<-qnorm(0.0001, a[1],sqrt(V[1,1]))
upp1<-qnorm(0.9999, a[1],sqrt(V[1,1]))
low2<-qnorm(0.0001, a[2],sqrt(V[2,2]))
upp2<-qnorm(0.9999, a[2],sqrt(V[2,2]))
adaptIntegrate(normal.zap, c(low1,low2),c(upp1,upp2), mu=a, V=V)$integral
}
a<-runif(2)
V<-diag(2)
pred2(a,V)
should give the same answer as pred when V is diagonal (after fixing
my mistakes in the last post). The forthcoming version of MCMCglmm
just has the single integration method, but I guess I could also put
this into the next version too.
4/ Currently you have zero covariances, but if you think they exist I
would model them. Remember, that a nice feature of zap models is that
the standard Poisson model is a special case. In terms of random
effects, this occurs when the correlation between family effects for
the two processes is 1 (i.e. fit ~family, rather than
idh(trait):family).
5/ I'm not sure I understand the question. The uncertainty here is
arising because of uncertainty in the parameters of the model? If so,
possibilities are that i) there is more information for the reference
level because other levels are more confounded with other predictors
in the model ii) the reference level is associated with larger
outcomes and so the same change on the link scale generates larger
differences on the data scale.
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
18:53:33 +0000:
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like I was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at this point. Of course, I have some follow up questions though.. I'm sorry if I'm being a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation functions.. 2. Could you make sure that I got this right: "When predictions are to be taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure you posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data on whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is this in the forthcoming MCMCglmm release and I might actually be able to refer to it once I get to submitting? 4. Could I change my model specification to forbid covariance between the two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around the prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an RDS file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's c)
not
implemented because nobody did yet or d) it's so simple that everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure there is some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random effects at all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied to
19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95% CI
eff.samp pMCMC
(Intercept) 2.088717 2.073009 2.103357
2000 <0.0005 ***
traitza_children -1.933491 -1.981945 -1.887863
2000 <0.0005 ***
maternalage.factor(14,20] 0.007709 -0.014238 0.027883
1500 0.460
maternalage.factor(35,50] 0.006350 -0.009634 0.024107
2000 0.462
paternallossTRUE 0.000797 -0.022716 0.025015
2000 0.925
maternallossTRUE -0.015542 -0.040240 0.009549
2000 0.226
center(nr.siblings) 0.005869 0.004302 0.007510
2000 <0.0005 ***
birth.cohort(1703,1722] -0.045487 -0.062240 -0.028965
2000 <0.0005 ***
birth.cohort(1722,1734] -0.055872 -0.072856 -0.036452
2000 <0.0005 ***
birth.cohort(1734,1743] -0.039770 -0.056580 -0.020907
2000 <0.0005 ***
birth.cohort(1743,1750] -0.030713 -0.048301 -0.012214
2000 0.002 **
urban -0.076748 -0.093240 -0.063002
2567 <0.0005 ***
male 0.106074 0.095705 0.115742
2000 <0.0005 ***
paternalage.mean -0.024119 -0.033133 -0.014444
2000 <0.0005 ***
paternalage.diff -0.018367 -0.032083 -0.005721
2000 0.007 **
traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169 -0.562554 1702 <0.0005 *** traitza_children:male -0.414988 -0.444589 -0.387005 2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way
to
obtain a response prediction for zero-inflated or hurdle type models? I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience" did
not
bode well. When it comes to humans, the researchers I presented to
are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use a binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a two-stage model with a prediction of p_i for the probability of having a non-zero response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi, 6/ you are correct - my mistake. 5b/ I'm still not sure what you mean by the uncertainty, but ignoring the random effects will result in under prediction. For a Poisson log-normal the mean is exp(a_1+v_1/2) rather than exp(a_1) for the standard Poisson. Cheers, Jarrod Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015 19:21:06 +0000:
Hi,
two more questions, sorry.
6.
could it be that where you wrote
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
you're not actually passing the variance but the standard deviation? so the
functions should be
normal.evd<-function(x, mu, sd){
exp(-exp(x))*dnorm(x, mu, sd)
}
normal.zt<-function(x, mu, sd){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
}
Because at least in the example, you were taking the square root here
already: sd_1<-mean(sqrt(m1$VCV[,1]))
Maybe this was a copy-paste from the source where you do it differently? Or
I misunderstood...
An addition to question 5:
5. When I use the correct inverse link function in my old code, but don't
do double integration, I do get a "reasonable" amount of uncertainty around
the reference factor. The predicted means are too low, which I put down to
this approach ignoring the VCV. I was just wondering how this could happen
(actually seeing the "reasonable" amount of uncertainty was what made me
hopeful that my approach might not be entirely wrong).
Best,
Ruben
On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com> wrote:
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like I was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at this point. Of course, I have some follow up questions though.. I'm sorry if I'm being a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation functions.. 2. Could you make sure that I got this right: "When predictions are to be taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure you posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data on whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is this in the forthcoming MCMCglmm release and I might actually be able to refer to it once I get to submitting? 4. Could I change my model specification to forbid covariance between the two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around the prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm going about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an
RDS
file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's c)
not
implemented because nobody did yet or d) it's so simple that everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure there is some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random effects at all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied
to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI u-95%
CI
eff.samp pMCMC (Intercept) 2.088717 2.073009
2.103357
2000 <0.0005 *** traitza_children -1.933491 -1.981945
-1.887863
2000 <0.0005 *** maternalage.factor(14,20] 0.007709 -0.014238
0.027883
1500 0.460 maternalage.factor(35,50] 0.006350 -0.009634
0.024107
2000 0.462 paternallossTRUE 0.000797 -0.022716
0.025015
2000 0.925 maternallossTRUE -0.015542 -0.040240
0.009549
2000 0.226 center(nr.siblings) 0.005869 0.004302
0.007510
2000 <0.0005 *** birth.cohort(1703,1722] -0.045487 -0.062240
-0.028965
2000 <0.0005 *** birth.cohort(1722,1734] -0.055872 -0.072856
-0.036452
2000 <0.0005 *** birth.cohort(1734,1743] -0.039770 -0.056580
-0.020907
2000 <0.0005 *** birth.cohort(1743,1750] -0.030713 -0.048301
-0.012214
2000 0.002 ** urban -0.076748 -0.093240
-0.063002
2567 <0.0005 *** male 0.106074 0.095705
0.115742
2000 <0.0005 *** paternalage.mean -0.024119 -0.033133
-0.014444
2000 <0.0005 *** paternalage.diff -0.018367 -0.032083
-0.005721
2000 0.007 ** traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169
-0.562554
1702 <0.0005 *** traitza_children:male -0.414988 -0.444589
-0.387005
2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com> wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope) way
to
obtain a response prediction for zero-inflated or hurdle type
models?
I've fit such models in MCMCglmm, but I don't work in ecology and my previous experience with explaining such models to "my audience"
did not
bode well. When it comes to humans, the researchers I presented to
are
not
used to offspring count being zero-inflated (or acquainted with that concept), but in my historical data with high infant mortality, it
is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is not clearly visible in either stage, only when both are taken together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd use
a
binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest
population
as a
non-replication of the effect because I was insufficiently able to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a
two-stage
model with a prediction of p_i for the probability of having a
non-zero
response (or in the case of zero-inflated models, the probability of _not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i, which would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi, thanks for your reply. This clears some things up. 1. Colour me interested then, if you ever find the time to put it on Github. Maybe after seeing the nonsense I've wrought in the following points you might find sending me a zip of the pre-release might save you some time as opposed to guessing where I went wrong? 2. I'm not sure I understand. You're saying I can add the VCV components for the family random effect to the respective unit components. But I would only want that if I was interested in predicting the effect for any family, which in my case is not the goal, since my population data includes all families and I want to know the effect for the average family in that population (to compare it to other populations). 3. Oh right, I'm sorry I think I got a little confused/sidetracked, these models have been with me for a while now and I learned about us()/idh() back then. 4. Okay, I will keep this in mind, but am for now focusing on getting the prediction for this simpler case right. 5. I made a reproducible report <http://rpubs.com/rubenarslan/repro_variab_mcmcglmm> to show the problem. I think the problem is not due to the changes I made in my prediction helper function but to some fundamental. It shows that for some toy data, the credibility intervals for predictions depend on the reference I choose for a factor predictor. This is less bad with the method I used initially simply using the inverse link. However, both methods _underpredict_ as far as I can tell. Surely I'm missing an important step, but I hope this illustrates what confuses me. Of course the reasons you map out for the discrepancy (variable confounding etc) are sound in theory, but I don't see these wide CIs using any other method (simple descriptives, lme4). Oh and I used a different data simulation, because the one you used actually induces zero-deflation as far as I can tell (a lot of zeroes, but lambda is very low too). Best regards and again thanks so much for taking the time, Ruben
On Wed, Mar 18, 2015 at 7:06 AM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi, 6/ you are correct - my mistake. 5b/ I'm still not sure what you mean by the uncertainty, but ignoring the random effects will result in under prediction. For a Poisson log-normal the mean is exp(a_1+v_1/2) rather than exp(a_1) for the standard Poisson. Cheers, Jarrod Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015 19:21:06 +0000:
Hi,
two more questions, sorry.
6.
could it be that where you wrote
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
you're not actually passing the variance but the standard deviation? so
the
functions should be
normal.evd<-function(x, mu, sd){
exp(-exp(x))*dnorm(x, mu, sd)
}
normal.zt<-function(x, mu, sd){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
}
Because at least in the example, you were taking the square root here
already: sd_1<-mean(sqrt(m1$VCV[,1]))
Maybe this was a copy-paste from the source where you do it differently?
Or
I misunderstood... An addition to question 5: 5. When I use the correct inverse link function in my old code, but don't do double integration, I do get a "reasonable" amount of uncertainty
around
the reference factor. The predicted means are too low, which I put down
to
this approach ignoring the VCV. I was just wondering how this could
happen
(actually seeing the "reasonable" amount of uncertainty was what made me hopeful that my approach might not be entirely wrong). Best, Ruben On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like
I
was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at
this
point. Of course, I have some follow up questions though.. I'm sorry if I'm
being
a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation
functions..
2. Could you make sure that I got this right: "When predictions are to
be
taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure
you
posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data
on
whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is
this
in the forthcoming MCMCglmm release and I might actually be able to
refer
to it once I get to submitting? 4. Could I change my model specification to forbid covariance between
the
two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around
the
prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm
going
about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an
RDS
file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb
mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's
c)
not
implemented because nobody did yet or d) it's so simple that
everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure
there is
some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I
have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random
effects at
all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied
to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=
I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI
u-95%
CI
eff.samp pMCMC (Intercept) 2.088717 2.073009
2.103357
2000 <0.0005 *** traitza_children -1.933491 -1.981945
-1.887863
2000 <0.0005 *** maternalage.factor(14,20] 0.007709 -0.014238
0.027883
1500 0.460 maternalage.factor(35,50] 0.006350 -0.009634
0.024107
2000 0.462 paternallossTRUE 0.000797 -0.022716
0.025015
2000 0.925 maternallossTRUE -0.015542 -0.040240
0.009549
2000 0.226 center(nr.siblings) 0.005869 0.004302
0.007510
2000 <0.0005 *** birth.cohort(1703,1722] -0.045487 -0.062240
-0.028965
2000 <0.0005 *** birth.cohort(1722,1734] -0.055872 -0.072856
-0.036452
2000 <0.0005 *** birth.cohort(1734,1743] -0.039770 -0.056580
-0.020907
2000 <0.0005 *** birth.cohort(1743,1750] -0.030713 -0.048301
-0.012214
2000 0.002 ** urban -0.076748 -0.093240
-0.063002
2567 <0.0005 *** male 0.106074 0.095705
0.115742
2000 <0.0005 *** paternalage.mean -0.024119 -0.033133
-0.014444
2000 <0.0005 *** paternalage.diff -0.018367 -0.032083
-0.005721
2000 0.007 ** traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169
-0.562554
1702 <0.0005 *** traitza_children:male -0.414988 -0.444589
-0.387005
2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com>
wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope)
way
to
obtain a response prediction for zero-inflated or hurdle type
models?
I've fit such models in MCMCglmm, but I don't work in ecology
and my
previous experience with explaining such models to "my audience"
did not
bode well. When it comes to humans, the researchers I presented
to
are
not
used to offspring count being zero-inflated (or acquainted with
that
concept), but in my historical data with high infant mortality,
it
is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is
not
clearly visible in either stage, only when both are taken
together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd
use
a
binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest
population
as a
non-replication of the effect because I was insufficiently able
to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a
two-stage
model with a prediction of p_i for the probability of having a
non-zero
response (or in the case of zero-inflated models, the probability
of
_not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i,
which
would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
9 days later
Hi Ruben, Regarding 2/ you want to add the family variance if you want to predict what the average response will be across families. If you want to predict for a specific family (for example one where the random effect = 0) then you should not include the variance. Cheers, Jarrod Quoting Ruben Arslan <rubenarslan at gmail.com> on Wed, 18 Mar 2015 13:51:50 +0000:
Hi, thanks for your reply. This clears some things up. 1. Colour me interested then, if you ever find the time to put it on Github. Maybe after seeing the nonsense I've wrought in the following points you might find sending me a zip of the pre-release might save you some time as opposed to guessing where I went wrong? 2. I'm not sure I understand. You're saying I can add the VCV components for the family random effect to the respective unit components. But I would only want that if I was interested in predicting the effect for any family, which in my case is not the goal, since my population data includes all families and I want to know the effect for the average family in that population (to compare it to other populations). 3. Oh right, I'm sorry I think I got a little confused/sidetracked, these models have been with me for a while now and I learned about us()/idh() back then. 4. Okay, I will keep this in mind, but am for now focusing on getting the prediction for this simpler case right. 5. I made a reproducible report <http://rpubs.com/rubenarslan/repro_variab_mcmcglmm> to show the problem. I think the problem is not due to the changes I made in my prediction helper function but to some fundamental. It shows that for some toy data, the credibility intervals for predictions depend on the reference I choose for a factor predictor. This is less bad with the method I used initially simply using the inverse link. However, both methods _underpredict_ as far as I can tell. Surely I'm missing an important step, but I hope this illustrates what confuses me. Of course the reasons you map out for the discrepancy (variable confounding etc) are sound in theory, but I don't see these wide CIs using any other method (simple descriptives, lme4). Oh and I used a different data simulation, because the one you used actually induces zero-deflation as far as I can tell (a lot of zeroes, but lambda is very low too). Best regards and again thanks so much for taking the time, Ruben On Wed, Mar 18, 2015 at 7:06 AM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi, 6/ you are correct - my mistake. 5b/ I'm still not sure what you mean by the uncertainty, but ignoring the random effects will result in under prediction. For a Poisson log-normal the mean is exp(a_1+v_1/2) rather than exp(a_1) for the standard Poisson. Cheers, Jarrod Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015 19:21:06 +0000:
Hi,
two more questions, sorry.
6.
could it be that where you wrote
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
you're not actually passing the variance but the standard deviation? so
the
functions should be
normal.evd<-function(x, mu, sd){
exp(-exp(x))*dnorm(x, mu, sd)
}
normal.zt<-function(x, mu, sd){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sd)
}
Because at least in the example, you were taking the square root here
already: sd_1<-mean(sqrt(m1$VCV[,1]))
Maybe this was a copy-paste from the source where you do it differently?
Or
I misunderstood... An addition to question 5: 5. When I use the correct inverse link function in my old code, but don't do double integration, I do get a "reasonable" amount of uncertainty
around
the reference factor. The predicted means are too low, which I put down
to
this approach ignoring the VCV. I was just wondering how this could
happen
(actually seeing the "reasonable" amount of uncertainty was what made me hopeful that my approach might not be entirely wrong). Best, Ruben On Tue, Mar 17, 2015 at 7:53 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Hi Jarrod, thanks for the extensive reply! This helps a lot, though it sounds like
I
was hubristic to attempt this myself. I tried using the approach you mapped out in the function gist <https://gist.github.com/rubenarslan/aeacdd306b3d061819a6> I posted. I simply put the pred function in a loop, so that I wouldn't make any mistakes while vectorising and since I don't care about performance at
this
point. Of course, I have some follow up questions though.. I'm sorry if I'm
being
a pain, I really appreciate the free advice, but understand of course if other things take precedence. 1. You're not "down with" developing publicly are you? Because I sure would like to test-drive the newdata prediction and simulation
functions..
2. Could you make sure that I got this right: "When predictions are to
be
taken after marginalising the random effects (including the `residual' over-dispersion) it is not possible to obtain closed form expressions." That is basically my scenario, right? In the example I included, I also had a group-level random effect (family). Ore are you talking about the "trait" as the random effect (as in your example) and my scenario is different and I cannot apply the numerical double integration procedure
you
posted? To be clear about my prediction goal without using language that I might be using incorrectly: I want to show what the average effect in the response unit, number of children, is in my population(s). I have data
on
whole populations and am using all of it (except individuals that don't have completed fertility yet, because I have yet to find a way to model both zero-inflation and right censoring). 3. "Numerical integration could be extended to double integration in which case covariance between the Poisson part and the binary part could be handled." That is what you posted an example of and it applies to my scenario, because I specified a prior R=list(V=diag(2), nu=1.002, fix=2) and rcov=~idh(trait):units, random=~idh(trait):idParents? But this double integration approach is something you just wrote off-the-cuff and I probably shouldn't use it in a publication? Or is
this
in the forthcoming MCMCglmm release and I might actually be able to
refer
to it once I get to submitting? 4. Could I change my model specification to forbid covariance between
the
two parts and not shoot myself in the foot? Would this allow for a more valid/tested approach to prediction? 5. When I use your method on my real data, I get less variation around
the
prediction "for the reference level" than for all other factor levels. My reference level actually has fewer cases than the others, so this isn't "right" in a way. Is this because I'm not doing newdata prediction? I get the "right" looking uncertainty if I bootstrap newdata predictions in lme4, Sorry if this is children's logic :-) Here's an image of the prediction <http://rpubs.com/rubenarslan/mcmcglmm_pred> and the raw data <http://rpubs.com/rubenarslan/raw>. Many thanks for any answers that you feel inclined to give. Best wishes, Ruben On Tue, Mar 17, 2015 at 5:31 PM Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
Hi,
Sorry - I should have replied to this post earlier. I've been working
on predict/simulate methods for all MCMcglmm models (including
zero-inflated/altered/hurdle/truncated models) and these are now
largely complete (together with newdata options).
When predictions are to be taken after marginalising the random
effects (including the `residual' over-dispersion) it is not possible
to obtain closed form expressions. The options that will be available
in MCMCglmm are:
1) algebraic approximation
2) numerical integration
3) simulation
1) and 2) are currently only accurate when the random/residual effect
structure implies no covariance between the Poisson part and the
binary part.
1) is reasonably accurate for zero-inflated distributions, but can be
pretty poor for the remainder because they all involve zero-truncated
Poisson log-normal distributions and my taylor approximation for the
mean is less than ideal (any suggestions would be helpful).
2) could be extended to double integration in which case covariance
between the Poisson part and the binary part could be handled.
In your code, part of the problem is that you have fitted a zapoisson,
but the prediction is based on a zipoisson (with complementary log-log
link rather than logt link).
In all zero-inflated/altered/hurdle/truncated models
E[y] = E[(1-prob)*meanc]
where prob is the probabilty of a zero in the binary part and meanc is
the mean of a Poisson distribution (zipoisson) or a zero-truncated
poisson (zapoisson and hupoisson). If we have eta_1 as the linear
predictor for the poisson part and eta_2 as the linear predictor for
the binary part:
In zipoisson: prob = plogis(eta_2) and meanc = exp(eta_1)
In zapoisson: prob = exp(-exp(eta_2)) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In hupoisson: prob = plogis(eta_2) and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In ztpoisson: prob = 0 and meanc =
exp(eta_1)/(1-exp(-exp(eta_1)))
In each case the linear predictor has a `fixed' part and a `random'
part which I'll denote as `a' and `u' respectively. Ideally we would
want
E[(1-prob)*meanc] taken over u_1 & u_2
if prob and meanc are independent this is a bit easier as
E[y] = E[1-prob]E[meanc]
and the two expectations ony need to taken with repsect to their
respective random effects. If we have sd_1 and sd_2 as the standard
deviations of the two sets of random effects then for the zapoisson:
normal.evd<-function(x, mu, v){
exp(-exp(x))*dnorm(x, mu, sqrt(v))
}
normal.zt<-function(x, mu, v){
exp(x)/(1-exp(-exp(x)))*dnorm(x, mu, sqrt(v))
}
pred<-function(a_1, a_2, sd_1, sd_2){
prob<-1-integrate(normal.evd, qnorm(0.0001, a_2,sd_2),
qnorm(0.9999, a_2,sd_2), a_2,sd_2)[[1]]
meanc<-integrate(normal.zt, qnorm(0.0001, a_1,sd_1),
qnorm(0.9999, a_1,sd_1), a_1,sd_1)[[1]]
prob*meanc
}
# gives the expected value with reasonable accuracy. As an example:
x<-rnorm(300)
l1<-rnorm(300, 1/2+x, sqrt(1))
l2<-rnorm(300, 1-x, sqrt(1))
y<-rbinom(300, 1, 1-exp(-exp(l2)))
y[which(y==1)]<-qpois(runif(sum(y==1), dpois(0,
exp(l1[which(y==1)])), 1), exp(l1[which(y==1)]))
# cunning sampler from Peter Dalgaard (R-sig-mixed)
data=data.frame(y=y, x=x)
prior=list(R=list(V=diag(2), nu=0.002, fix=2))
m1<-MCMCglmm(y~trait+trait:x-1, rcov=~idh(trait):units, data=data,
family="zapoisson", prior=prior)
b_1<-colMeans(m1$Sol)[c(1,3)]
b_2<-colMeans(m1$Sol)[c(2,4)]
sd_1<-mean(sqrt(m1$VCV[,1]))
sd_2<-mean(sqrt(m1$VCV[,2]))
# note it is more accurate to take the posterior mean prediction
rather than the prediction from the posterior means as I've done here,
but for illustration:
x.pred<-seq(-3,3,length=100)
p<-1:100
for(i in 1:100){
p[i]<-pred(a_1 = b_1[1]+x.pred[i]*b_1[2], a_2 =
b_2[1]+x.pred[i]*b_2[2], sd_1=sd_1, sd_2=sd_2)
}
plot(y~x)
lines(p~x.pred)
Cheers,
Jarrod
Quoting Ruben Arslan <rubenarslan at gmail.com> on Tue, 17 Mar 2015
13:33:25 +0000:
Dear list, I've made a reproducible example of the zero-altered prediction, in the hope that someone will have a look and reassure me that I'm
going
about this the right way. I was a bit confused by the point about p_i and n_i being correlated
(they
are in my case), but I think this was a red herring for me since I'm not deriving the variance analytically. The script is here: https://gist.github.com/rubenarslan/
aeacdd306b3d061819a6
and if you don't want to run the simulation fit yourself, I've put an
RDS
file of the fit here: https://dl.dropboxusercontent.com/u/1078620/m1.rds Best regards, Ruben Arslan On Tue, Mar 10, 2015 at 1:22 PM Ruben Arslan <rubenarslan at gmail.com>
wrote:
Dear Dr Bolker, I'd thought about something like this, one point of asking was to see whether a) it's implemented already, because I'll probably make dumb
mistakes
while
trying b) it's not implemented because it's a bad idea. Your response and the MCMCglmm course notes make me hope that it's
c)
not
implemented because nobody did yet or d) it's so simple that
everybody
does
it on-the-fly. So I tried my hand and would appreciate corrections. I am sure
there is
some screw-up or an inelegant approach in there. I included code for dealing with mcmc.lists because that's what I
have
and
I'm not entirely sure how I deal with them is correct either. I started with a zero-altered model, because those fit fastest and according to the course notes have the least complex likelihood. Because I know not what I do, I'm not dealing with my random
effects at
all. I pasted a model summary below to show what I've applied the below function to. The function gives the following quantiles when applied
to 19
chains of that model.
5% 50% 95%
5.431684178 5.561211207 5.690655200
5% 50% 95%
5.003974382 5.178192327 5.348246558
Warm regards,
Ruben Arslan
HPDpredict_za = function(object, predictor) {
if(class(object) != "MCMCglmm") {
if(length( object[[1]]$Residual$nrt )>1) {
object = lapply(object,FUN=function(x) { x$Residual$nrt<-2;x })
}
Sol = mcmc.list(lapply(object,FUN=function(x) { x$Sol}))
vars = colnames(Sol[[1]])
} else {
Sol = as.data.frame(object$Sol)
vars = names(Sol)
}
za_predictor = vars[ vars %ends_with% predictor & vars %begins_with%
"traitza_"]
za_intercept_name = vars[ ! vars %contains% ":" & vars %begins_with%
"traitza_"]
intercept = Sol[,"(Intercept)"]
za_intercept = Sol[, za_intercept_name]
l1 = Sol[, predictor ]
l2 = Sol[, za_predictor ]
if(is.list(object)) {
intercept = unlist(intercept)
za_intercept = unlist(za_intercept)
l1 = unlist(l1)
l2 = unlist(l2)
}
py_0 = dpois(0, exp(intercept + za_intercept))
y_ygt0 = exp(intercept)
at_intercept = (1-py_0) * y_ygt0
py_0 = dpois(0, exp(intercept + za_intercept + l2))
y_ygt0 = exp(intercept + l1)
at_predictor_1 = (1-py_0) * y_ygt0
print(qplot(at_intercept))
print(qplot(at_predictor_1))
df = data.frame("intercept" = at_intercept)
df[, predictor] = at_predictor_1
print(qplot(x=variable,
y=value,data=suppressMessages(melt(df)),fill=variable,alpha=
I(0.40),
geom =
'violin')) print(quantile(at_intercept, probs = c(0.05,0.5,0.95))) print(quantile(at_predictor_1, probs = c(0.05,0.5,0.95))) invisible(df) }
summary(object[[1]])
Iterations = 100001:299901
Thinning interval = 100
Sample size = 2000
DIC: 349094
G-structure: ~idh(trait):idParents
post.mean l-95% CI u-95% CI eff.samp
children.idParents 0.0189 0.0164 0.0214 1729
za_children.idParents 0.2392 0.2171 0.2622 1647
R-structure: ~idh(trait):units
post.mean l-95% CI u-95% CI eff.samp
children.units 0.144 0.139 0.148 1715
za_children.units 1.000 1.000 1.000 0
Location effects: children ~ trait * (maternalage.factor +
paternalloss +
maternalloss + center(nr.siblings) + birth.cohort + urban + male +
paternalage.mean + paternalage.diff)
post.mean l-95% CI
u-95%
CI
eff.samp pMCMC (Intercept) 2.088717 2.073009
2.103357
2000 <0.0005 *** traitza_children -1.933491 -1.981945
-1.887863
2000 <0.0005 *** maternalage.factor(14,20] 0.007709 -0.014238
0.027883
1500 0.460 maternalage.factor(35,50] 0.006350 -0.009634
0.024107
2000 0.462 paternallossTRUE 0.000797 -0.022716
0.025015
2000 0.925 maternallossTRUE -0.015542 -0.040240
0.009549
2000 0.226 center(nr.siblings) 0.005869 0.004302
0.007510
2000 <0.0005 *** birth.cohort(1703,1722] -0.045487 -0.062240
-0.028965
2000 <0.0005 *** birth.cohort(1722,1734] -0.055872 -0.072856
-0.036452
2000 <0.0005 *** birth.cohort(1734,1743] -0.039770 -0.056580
-0.020907
2000 <0.0005 *** birth.cohort(1743,1750] -0.030713 -0.048301
-0.012214
2000 0.002 ** urban -0.076748 -0.093240
-0.063002
2567 <0.0005 *** male 0.106074 0.095705
0.115742
2000 <0.0005 *** paternalage.mean -0.024119 -0.033133
-0.014444
2000 <0.0005 *** paternalage.diff -0.018367 -0.032083
-0.005721
2000 0.007 ** traitza_children:maternalage.factor(14,20] -0.116510 -0.182432
-0.051978
1876 0.001 *** traitza_children:maternalage.factor(35,50] -0.045196 -0.094485
0.002640
2000 0.075 . traitza_children:paternallossTRUE -0.171957 -0.238218
-0.104820
2000 <0.0005 *** traitza_children:maternallossTRUE -0.499539 -0.566825
-0.430637
2000 <0.0005 *** traitza_children:center(nr.siblings) -0.023723 -0.028676
-0.018746
1848 <0.0005 *** traitza_children:birth.cohort(1703,1722] -0.026012 -0.074250
0.026024
2000 0.319 traitza_children:birth.cohort(1722,1734] -0.279418 -0.329462
-0.227187
2000 <0.0005 *** traitza_children:birth.cohort(1734,1743] -0.260165 -0.312659
-0.204462
2130 <0.0005 *** traitza_children:birth.cohort(1743,1750] -0.481457 -0.534568
-0.426648
2000 <0.0005 *** traitza_children:urban -0.604108 -0.645169
-0.562554
1702 <0.0005 *** traitza_children:male -0.414988 -0.444589
-0.387005
2000 <0.0005 *** traitza_children:paternalage.mean 0.006545 -0.018570
0.036227
2000 0.651 traitza_children:paternalage.diff -0.097982 -0.136302
-0.060677
2000 <0.0005 *** On Mon, Mar 9, 2015 at 10:12 PM Ben Bolker <bbolker at gmail.com>
wrote:
Ruben Arslan <rubenarslan at ...> writes:
Dear list, I wanted to ask: Is there any (maybe just back of the envelope)
way
to
obtain a response prediction for zero-inflated or hurdle type
models?
I've fit such models in MCMCglmm, but I don't work in ecology
and my
previous experience with explaining such models to "my audience"
did not
bode well. When it comes to humans, the researchers I presented
to
are
not
used to offspring count being zero-inflated (or acquainted with
that
concept), but in my historical data with high infant mortality,
it
is
(in
modern data it's actually slightly underdispersed). Currently I'm using lme4 and simply splitting my models into two
stages
(finding a mate and having offspring). That's okay too, but in one population the effect of interest is
not
clearly visible in either stage, only when both are taken
together
(but
then the outcome is zero-inflated). I expect to be given a hard time for this and hence thought I'd
use
a
binomial model with the outcome offspring>0 as my main model, but
that
turns out to be hard to explain too and doesn't really do the data justice. Basically I don't want to be forced to discuss my smallest
population
as a
non-replication of the effect because I was insufficiently able
to
explain
the statistics behind my reasoning that the effect shows.
I think the back-of-the envelope answer would be that for a
two-stage
model with a prediction of p_i for the probability of having a
non-zero
response (or in the case of zero-inflated models, the probability
of
_not_ having a structural zero) and a prediction of n_i for the conditional part of the model, the mean predicted value is p_i*n_i and the variance is _approximately_ (p_i*n_i)^2*(var(p_i)/p_i^2 +
var(n_i)/n_i^2)
(this is assuming that you haven't built in any correlation between p_i and n_i,
which
would be hard in lme4 but _might_ be possible under certain
circumstances
via a multitype model in MCMCglmm). Does that help?
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.