bigglm() results different from glm()
This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct. Your model hasn't converged yet. You can get the right answer either by running longer:
summary(m1big_longer)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
chunksize = 100000, maxit = 20)
Sample size = 100000
Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0
or supplying starting values:
summary(m1big_started)
Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
chunksize = 100000, start = c(2, 0))
Sample size = 100000
Coef (95% CI) SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2 0.405 0.401 0.408 0.002 0
The bug is that you weren't told about the lack of convergence. There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails.
m1big$converged
NULL
m1big_longer$converged
[1] TRUE
m1big_started$converged
[1] TRUE
For the next version I will make sure there is a clear warning when the model hasn't converged. The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations. This strategy obviously breaks down when you don't notice that failure. :(
-thomas
On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:
Dear all, I am using the bigglm package to fit a few GLM's to a large dataset (3 million rows, 6 columns). While trying to fit a Poisson GLM I noticed that the coefficient estimates were very different from what I obtained when estimating the model on a smaller dataset using glm(), I wrote a very basic toy example to compare the results of bigglm() against a glm() call. Consider the following code:
require(biglm)
options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly"))
dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
summary(m1)
<snipped output for this email>
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.30305 0.00141 1629 <2e-16 ***
ttment2 0.40429 0.00183 221 <2e-16 ***
Null deviance: 151889 on 99999 degrees of freedom
Residual deviance: 101848 on 99998 degrees of freedom
AIC: 533152
summary(m1big)
Large data regression model: bigglm(y ~ ttment, data = dat, family =
poisson(link = "log"))
Sample size = 100000
Coef (95% CI) SE p
(Intercept) 2.651 2.650 2.653 0.001 0
ttment2 4.346 4.344 4.348 0.001 0
m1big$deviance
[1] 287158986 Notice that the coefficients and deviance are quite different in the model estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the estimates remain the same. Can someone help me understand what is causing these differences? Here is my version info:
version
_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 8.1 year 2008 month 12 day 22 svn rev 47281 language R version.string R version 2.8.1 (2008-12-22) Many thanks in advance for your help, Francisco -- Francisco J. Zagmutt Vose Consulting 2891 20th Street Boulder, CO, 80304 USA www.voseconsulting.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle