Prev 173946 / 398503 Next

bigglm() results different from glm()

Tue, Mar 17, 2009 12:50 AM

This is a surprisingly interesting problem that took a while to debug, because the computations all seemed correct.

Your model hasn't converged yet.  You can get the right answer either by running longer:

Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
     chunksize = 100000, maxit = 20)
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0

or supplying starting values:

Large data regression model: bigglm(y ~ ttment, data = dat, family = poisson(link = "log"),
     chunksize = 100000, start = c(2, 0))
Sample size =  100000
              Coef  (95%   CI)    SE p
(Intercept) 2.304 2.301 2.307 0.001 0
ttment2     0.405 0.401 0.408 0.002 0


The bug is that you weren't told about the lack of convergence.  There is a flag in the object, but it is only set after the model is converged, so it is not there when convergence fails.

NULL

[1] TRUE

[1] TRUE

For the next version I will make sure there is a clear warning when the model hasn't converged.  The default maximum number of iterations is fairly small, by design --- if it isn't working, you want to find out and specify starting values rather than wait for dozens of potentially slow iterations.  This strategy obviously breaks down when you don't notice that failure. :(

      -thomas

On Mon, 16 Mar 2009, Francisco J. Zagmutt wrote:

Dear all,

I am using the bigglm package to fit a few GLM's to a large dataset (3 million 
rows, 6 columns).  While trying to fit a Poisson GLM I noticed that the 
coefficient estimates were very different from what I obtained when estimating 
the model on a smaller dataset using glm(), I wrote a very basic toy example to 
compare the results of bigglm() against a glm() call.  Consider the following 
code:

require(biglm)
options(digits=6, scipen=3, contrasts = c("contr.treatment", "contr.poly"))
dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)), ttment=gl(2,50000))
m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
summary(m1)

<snipped output for this email>
Coefficients:
           Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.30305    0.00141    1629   <2e-16 ***
ttment2      0.40429    0.00183     221   <2e-16 ***

   Null deviance: 151889  on 99999  degrees of freedom
Residual deviance: 101848  on 99998  degrees of freedom
AIC: 533152

summary(m1big)

Large data regression model: bigglm(y ~ ttment, data = dat, family = 
poisson(link = "log"))
Sample size =  100000
            Coef  (95%   CI)    SE p
(Intercept) 2.651 2.650 2.653 0.001 0
ttment2     4.346 4.344 4.348 0.001 0

m1big$deviance

[1] 287158986


Notice that the coefficients and deviance are quite different in the model 
estimated using bigglm(). If I change the chunk to seq(1000,10000,1000) the 
estimates remain the same.

Can someone help me understand what is causing these differences?

Here is my version info:

version

              _
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          8.1
year           2008
month          12
day            22
svn rev        47281
language       R
version.string R version 2.8.1 (2008-12-22)


Many thanks in advance for your help,

Francisco

-- 
Francisco J. Zagmutt
Vose Consulting
2891 20th Street
Boulder, CO, 80304
USA
www.voseconsulting.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Thread (8 messages)

Francisco J. Zagmutt bigglm() results different from glm() Mar 16 Thomas Lumley bigglm() results different from glm() Mar 17 John Fox bigglm() results different from glm() Mar 17 Thomas Lumley bigglm() results different from glm() Mar 17 Francisco J. Zagmutt bigglm() results different from glm() Mar 17 Francisco J. Zagmutt bigglm() results different from glm() Mar 17 Thomas Lumley bigglm() results different from glm() Mar 18 Francisco J. Zagmutt bigglm() results different from glm() Mar 18