Unexpected behavior with weights in binomial glm()
I haven't followed this thread closely, but if perfect separation in a binomial glm is the problem, google it. e.g. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/complete_separation_logit_models.htm This presumably explains your concerns about coefficient agreement. -- Bert On Sun, Sep 30, 2012 at 4:47 AM, Josh Browning
<rockclimber112358 at gmail.com> wrote:
Hi David, Yes, I agree that the results are "very similar" but I don't understand why they are not exactly equal given that the data sets are identical. And yes, this 1% numerical difference is hugely important to me. I have another data set (much larger than this toy example) that works on the aggregated data (returning a coefficient of about 1) but returns the warning about perfect separation on the non-aggregated data (and a coefficient of about 1e15). So, I'd at least like to be able to understand where this numerical difference is coming from and, preferably, a way to tweak my glm() runs (possibly adjusting the numerical precision somehow???) so that this doesn't happen. Josh On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:
Hi useRs, I'm experiencing something quite weird with glm() and weights, and maybe someone can explain what I'm doing wrong. I have a dataset where each row represents a single case, and I run glm(...,family="binomial") and get my coefficients. However, some of my cases have the exact same values for predictor variables, so I should be able to aggregate up my data frame and run glm(..., family="binomial",weights=wts) and get the same coefficients (maybe this is my incorrect assumption, but I can't see why it would be). Anyways, here's a minimum working example below:
d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) ) glm( RESP ~ INDEP, family="binomial", data=d )
Call: glm(formula = RESP ~ INDEP, family = "binomial", data = d)
Coefficients:
(Intercept) INDEP
-1.609 21.176
Degrees of Freedom: 9 Total (i.e. Null); 8 Residual
Null Deviance: 13.86
Residual Deviance: 5.407 AIC: 9.407
dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )
Call: glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg,
weights = WT)
Coefficients:
(Intercept) INDEP
-1.609 20.975
Degrees of Freedom: 2 Total (i.e. Null); 1 Residual
Null Deviance: 13.86
Residual Deviance: 5.407 AIC: 9.407
Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical difference in the regression coefficient? Am I reading you correctly? -- David Winsemius, MD Alameda, CA, USA
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm