zero-truncated Poisson
Thanks Ben and Peter for your thoughtful replies. This gives me quite a bit to chew on and digest but I think I can analyze these data. -----Original Message----- From: Ben Bolker [mailto:bbolker at gmail.com] Sent: Sunday, April 10, 2011 10:38 AM To: Peter Solymos Cc: Stratford, Jeffrey; r-sig-ecology at r-project.org Subject: Re: [R-sig-eco] zero-truncated Poisson
On 11-04-09 03:07 AM, Peter Solymos wrote:
Jeff,
The zero truncated Poisson distribution can be described as a
probability function conditional on Y>0:
Pr(Y=y|Y>0) = Pr(Y=y) / (1-Pr(Y=0)), y=1,2,3,...
This leads to this function 'nll' for the negative log-likelihood:
nll <- function(theta, y, X) {
lambda <- exp(drop(X %*% theta))
-sum(dpois(y, lambda, log=TRUE) - log(1-exp(-lambda)))
}
where theta is the vector of parameters, y is the observation vector,
X is the design matrix.
A simple simulation is useful to check that the theory works. I used
'optim' with the default Nelder-Mead simplex algorithm, initial values
are based in non-truncated Poisson estimates to speed up convergence:
theta <- c(1, 2)
X <- model.matrix(~rnorm(100))
y <- rpois(100,exp(X %*% theta))
X <- X[y>0,]
y <- y[y>0]
inits <- coef(glm(y ~ X-1, family=poisson))
res <- optim(par=inits, fn=nll, y=y, X=X)
cbind(theta=theta, theta.hat=res$par)
Here I used the data you provided, note that 'hessian=TRUE' for some
subsequent SE calculations:
inits <- coef(glm(taken ~ mass + year, blja, family=poisson))
y <- blja$taken
X <- model.matrix(~ mass + year, blja)
res <- optim(par=inits, fn=nll, y=y, X=X, hessian=TRUE)
Vector of parameters that minimize 'nll' (maximize the log-likelihood)
given the data:
cf <- res$par
Standard error (note that due to minimization, I used the Hessian
instead of -H):
se <- sqrt(diag(solve(res$hessian)))
z and p values and the final output table:
zstat <- cf/se
pval <- 2 * pnorm(-abs(zstat))
out <- cbind(cf, se, zstat, pval)
colnames(out) <- c("Estimate", "Std. Error", "z value", "Pr(>|z|)")
printCoefmat(out, digits = 4, signif.legend = FALSE)
For your data set, this gives:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.76392 0.10659 -7.167 7.66e-13 ***
mass 0.24008 0.01262 19.028 < 2e-16 ***
yeartwo -0.29006 0.11974 -2.423 0.0154 *
Cheers,
Peter
P?ter S?lymos
Alberta Biodiversity Monitoring Institute
and Boreal Avian Modelling project
Department of Biological Sciences
CW 405, Biological Sciences Bldg
University of Alberta
Edmonton, Alberta, T6G 2E9, Canada
Phone: 780.492.8534
Fax: 780.492.7635
email <- paste("solymos", "ualberta.ca", sep = "@")
http://www.abmi.ca
http://www.borealbirds.ca
http://sites.google.com/site/psolymos
I agree. I would just like to point out that, if you define the
truncated Poisson distribution function appropriately, you can then do
this very compactly via the mle2 function in the bbmle package.
dtruncpois <- function(x,lambda,log=FALSE) {
r <- ifelse(x==0,-Inf,dpois(x,lambda,log=TRUE)-
log(1-dpois(0,lambda,log=FALSE)))
if (log) r else exp(r)
}
library(bbmle)
(m1 <- mle2(taken~dtruncpois(exp(logmu)),
parameters=list(logmu~mass+year),
data=blja,
start=list(logmu=1)))
coef(m1)
summary(m1) ## gives a table almost equivalent to Peter's
confint(m1) ## profile confidence intervals
After a bit of digging it seems that VGAM::vglm fails when it hits an
NaN -- I haven't figured out more than that it gets this on the *second*
step, not the first, of the optimization. When it works VGAM should be
faster than the sort of generalized ML estimation that Peter and I are
suggesting, but I have fairly often found it to be not very robust when
I try it.
If you plot the data (with a regular Poisson GLM superimposed):
library(ggplot2)
ggplot(blja,aes(x=mass,y=taken,colour=year))+
stat_sum(alpha=0.5)+geom_smooth(method="glm",family="poisson")+
theme_bw()
You see that there are two observations with very large masses.
It turns out that removing these two observations allows vglm() to work:
glm1 <- vglm(taken ~ mass + year,family=pospoisson,
data=blja,subset=mass<10)
m2 <- mle2(taken~dtruncpois(exp(logmu)),
parameters=list(logmu~mass+year),
data=subset(blja,mass<10),
start=list(logmu=1))
It also changes the estimate of the mass effect considerably:
confint(m1)
confint(m2)
Ben Bolker