Francisco J. Zagmutt
Vose Consulting
2891 20th Street
Boulder, CO, 80304
USA
www.voseconsulting.com
Thomas Lumley wrote:
>
> There's a digits= argument to the print method.
>
>> a <- bigglm(ff,data=trees, chunksize=10, sandwich=TRUE)
>> print(summary(a),digits=5)
> Large data regression model: bigglm(ff, data = trees, chunksize = 10,
> sandwich = TRUE)
> Sample size = 31
> Coef (95% CI) SE p
> (Intercept) -6.63162 -8.08729 -5.17595 0.72783 0
> log(Girth) 1.98265 1.87132 2.09398 0.05567 0
> log(Height) 1.11712 0.73339 1.50085 0.19186 0
> Sandwich (model-robust) standard errors
>
>
> I suppose I should make it take its default from options(digits)-3 or
> something.
>
> -thomas
>
>
> On Tue, 17 Mar 2009, Francisco J. Zagmutt wrote:
>
>> Dear Thomas and John,
>>
>> Thanks for your prompt reply and for looking at your code to explain
>> these differences. I see you replied very late at night, so I am sorry
>> if my little problem kept you awake!
>>
>> As you pointed out, the model indeed converges (as reported in
>> model$converged) once I specify a larger number of iterations.
>>
>> A very minor comment: it seems that the reporting of the estimates in
>> summary.biglm() is not taking the parameters from options(digits).
>> For example, using the same data and models as before:
>>
>>> require(biglm)
>>> options(digits=8)
>>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>> ttment=gl(2,50000))
>>> m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>>> m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"),
>>> maxit=20)
>>
>>> summary(m1)
>> <Snipped>
>> Coefficients:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 2.3019509 0.0014147 1627.21 < 2.2e-16 ***
>> ttment2 0.4052002 0.0018264 221.86 < 2.2e-16 ***
>>
>>> summary(m1big)
>> Coef (95% CI) SE p
>> (Intercept) 2.302 2.299 2.305 0.001 0
>> ttment2 0.405 0.402 0.409 0.002 0
>>
>>
>> To get more digits I can extract the point estimates using
>> coef(m1big), but after looking at str(m1big) the only way I could
>> figure to extract the p-values was:
>>
>>> summary(m1big)$mat[,"p"]
>> (Intercept) ttment2
>> 0 0
>>
>> Is there a way I can get summary.biglm() to report more digits directly?
>>
>> Thanks again,
>>
>> Francisco
>>
>>
>>
>> --
>> Francisco J. Zagmutt
>> Vose Consulting
>> 2891 20th Street
>> Boulder, CO, 80304
>> USA
>> www.voseconsulting.com
>>
>>
>> Thomas Lumley wrote:
>>>
>>> Yes, the slow convergence is easier to get with the log link.
>>> Overshooting the correct coefficient vector has more dramatic effects
>>> on the fitted values and weights, and in this example the starting
>>> value of (0,0) is a long way from the truth. The same sort of thing
>>> happens in the Cox model, where there are real data sets that will
>>> cause numeric overflow in a careless implementation.
>>>
>>> It might be worth trying to guess better starting values: saving an
>>> iteration or two would be useful with large data sets.
>>>
>>> -thomas
>>>
>>>
>>> On Tue, 17 Mar 2009, John Fox wrote:
>>>
>>>> Dear Francisco,
>>>>
>>>> I was able to duplicate the problem that you reported, and in addition
>>>> discovered that the problem seems to be peculiar to the poisson family.
>>>> lm(y~ttment, data=dat) and biglm(y~ttment, data=dat) produce identical
>>>> results, as do glm(y~ttment, data=dat) and bigglm(y~ttment, data=dat).
>>>> Another example, with the binomial family:
>>>>
>>>>> set.seed(12345)
>>>>> dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>>> ttment=gl(2,50000))
>>>>> dat$y0 <- ifelse(dat$y > 12, 1, 0)
>>>>> m1b <- glm(y0~ttment, data=dat, family=binomial)
>>>>> m1bigb <- bigglm(y0~ttment , data=dat, family=binomial())
>>>>> coef(m1b)
>>>> (Intercept) ttment2
>>>> -1.33508 2.34263
>>>>> coef(m1bigb)
>>>> (Intercept) ttment2
>>>> -1.33508 2.34263
>>>>> deviance(m1b)
>>>> [1] 109244
>>>>> deviance(m1bigb)
>>>> [1] 109244
>>>>
>>>> I'm copying this message to Thomas Lumley, who's the author and
>>>> maintainer
>>>> of the biglm package.
>>>>
>>>> Regards,
>>>> John
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
>>>>> [mailto:r-help-bounces at r-project.org]
>>>> On
>>>>> Behalf Of Francisco J. Zagmutt
>>>>> Sent: March-16-09 10:26 PM
>>>>> To: r-help at stat.math.ethz.ch
>>>>> Subject: [R] bigglm() results different from glm()
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I am using the bigglm package to fit a few GLM's to a large dataset (3
>>>>> million rows, 6 columns). While trying to fit a Poisson GLM I noticed
>>>>> that the coefficient estimates were very different from what I
>>>>> obtained
>>>>> when estimating the model on a smaller dataset using glm(), I wrote a
>>>>> very basic toy example to compare the results of bigglm() against a
>>>>> glm() call. Consider the following code:
>>>>>
>>>>>
>>>>> > require(biglm)
>>>>> > options(digits=6, scipen=3, contrasts = c("contr.treatment",
>>>>> "contr.poly"))
>>>>> > dat=data.frame(y =c(rpois(50000, 10),rpois(50000, 15)),
>>>>> ttment=gl(2,50000))
>>>>> > m1 <- glm(y~ttment, data=dat, family=poisson(link="log"))
>>>>> > m1big <- bigglm(y~ttment , data=dat, family=poisson(link="log"))
>>>>> > summary(m1)
>>>>>
>>>>> <snipped output for this email>
>>>>> Coefficients:
>>>>> Estimate Std. Error z value Pr(>|z|)
>>>>> (Intercept) 2.30305 0.00141 1629 <2e-16 ***
>>>>> ttment2 0.40429 0.00183 221 <2e-16 ***
>>>>>
>>>>> Null deviance: 151889 on 99999 degrees of freedom
>>>>> Residual deviance: 101848 on 99998 degrees of freedom
>>>>> AIC: 533152
>>>>>
>>>>> > summary(m1big)
>>>>> Large data regression model: bigglm(y ~ ttment, data = dat, family =
>>>>> poisson(link = "log"))
>>>>> Sample size = 100000
>>>>> Coef (95% CI) SE p
>>>>> (Intercept) 2.651 2.650 2.653 0.001 0
>>>>> ttment2 4.346 4.344 4.348 0.001 0
>>>>>
>>>>> > m1big$deviance
>>>>> [1] 287158986
>>>>>
>>>>>
>>>>> Notice that the coefficients and deviance are quite different in the
>>>>> model estimated using bigglm(). If I change the chunk to
>>>>> seq(1000,10000,1000) the estimates remain the same.
>>>>>
>>>>> Can someone help me understand what is causing these differences?
>>>>>
>>>>> Here is my version info:
>>>>>
>>>>> > version
>>>>> _
>>>>> platform i386-pc-mingw32
>>>>> arch i386
>>>>> os mingw32
>>>>> system i386, mingw32
>>>>> status
>>>>> major 2
>>>>> minor 8.1
>>>>> year 2008
>>>>> month 12
>>>>> day 22
>>>>> svn rev 47281
>>>>> language R
>>>>> version.string R version 2.8.1 (2008-12-22)
>>>>>
>>>>>
>>>>> Many thanks in advance for your help,
>>>>>
>>>>> Francisco
>>>>>
>>>>> -- Francisco J. Zagmutt
>>>>> Vose Consulting
>>>>> 2891 20th Street
>>>>> Boulder, CO, 80304
>>>>> USA
>>>>> www.voseconsulting.com
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> Thomas Lumley Assoc. Professor, Biostatistics
>>> tlumley at u.washington.edu University of Washington, Seattle
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, Seattle
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>