Thank you all for your help and advice. This wasn't quite the answer I was looking for, but these concepts make more sense to me now and I think I should be able to resolve the issues I've been having. Thanks again!
On Sun, Sep 30, 2012 at 6:26 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Sep 30, 2012, at 4:47 AM, Josh Browning wrote:
Hi David, Yes, I agree that the results are "very similar" but I don't understand why they are not exactly equal given that the data sets are identical. And yes, this 1% numerical difference is hugely important to me.
I believe you will find it is not important at all.
I have another data set (much larger than this toy example) that works on the aggregated data (returning a coefficient of about 1) but returns the warning about perfect separation on the non-aggregated data (and a coefficient of about 1e15). So, I'd at least like to be able to understand where this numerical difference is coming from and, preferably, a way to tweak my glm() runs (possibly adjusting the numerical precision somehow???) so that this doesn't happen.
Here. This shows how you can ratchet up your desired precision:
dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-10,
maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-11,
maxit = 30) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-12,
maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-13,
maxit = 50) )
... and finally get a message that _should_ demonstrate you why I thought asking for greater numerical precision in an extreme case such as this was a "Fool's errand". The "true" value is positive infinity, but numerical precision is only 8 bytes so we could only get to an estimated odds ratio of exp(32.975 ) = 2.09344e+14, which corresponds to odds of 200 trillion to 1. Further increases on the 'maxit' and decreases in 'epsilon' are vigorously ignored, .... and rightfully so.
After you do such analyses long enough you learn that coefficients above 5 are suspicious and those above 10 are usually a sign of an error in data preparation.
(I do not think this is an example of complete separation, since the range of the predictors overlap for the two outcomes. But who know I have made (many) errors before.)
--
David,
Josh On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:
Hi useRs, I'm experiencing something quite weird with glm() and weights, and maybe someone can explain what I'm doing wrong. I have a dataset where each row represents a single case, and I run glm(...,family="binomial") and get my coefficients. However, some of my cases have the exact same values for predictor variables, so I should be able to aggregate up my data frame and run glm(..., family="binomial",weights=wts) and get the same coefficients (maybe this is my incorrect assumption, but I can't see why it would be). Anyways, here's a minimum working example below:
d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) ) glm( RESP ~ INDEP, family="binomial", data=d )
Call: glm(formula = RESP ~ INDEP, family = "binomial", data = d) Coefficients: (Intercept) INDEP -1.609 21.176 Degrees of Freedom: 9 Total (i.e. Null); 8 Residual Null Deviance: 13.86 Residual Deviance: 5.407 AIC: 9.407
dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )
Call: glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg, weights = WT) Coefficients: (Intercept) INDEP -1.609 20.975 Degrees of Freedom: 2 Total (i.e. Null); 1 Residual Null Deviance: 13.86 Residual Deviance: 5.407 AIC: 9.407
Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical difference in the regression coefficient? Am I reading you correctly? -- David Winsemius, MD Alameda, CA, USA
David Winsemius, MD Alameda, CA, USA