Unexpected behavior with weights in binomial glm() - R-help

Tue, Oct 2, 2012 9:18 AM #

Thank you all for your help and advice.  This wasn't quite the answer
I was looking for, but these concepts make more sense to me now and I
think I should be able to resolve the issues I've been having.

Thanks again!

On Sun, Sep 30, 2012 at 6:26 PM, David Winsemius <dwinsemius at comcast.net> wrote:

On Sep 30, 2012, at 4:47 AM, Josh Browning wrote:

Hi David,

Yes, I agree that the results are "very similar" but I don't
understand why they are not exactly equal given that the data sets are
identical.

And yes, this 1% numerical difference is hugely important to me.

I believe you will find it is not important at all.

I have another data set (much larger than this toy example) that works
on the aggregated data (returning a coefficient of about 1) but
returns the warning about perfect separation on the non-aggregated
data (and a coefficient of about 1e15).  So, I'd at least like to be
able to understand where this numerical difference is coming from and,
preferably, a way to tweak my glm() runs (possibly adjusting the
numerical precision somehow???) so that this doesn't happen.

Here. This shows how you can ratchet up your desired precision:

dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")

glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-10,
    maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-11,
    maxit = 30) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-12,
    maxit = 50) )
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT, control = glm.control(epsilon = 1e-13,
    maxit = 50) )

 ... and finally get a message that _should_ demonstrate you why I thought asking for greater numerical precision in an extreme case such as this was a "Fool's errand". The "true" value is positive infinity, but numerical precision is only 8 bytes so we could only get to an estimated odds ratio of exp(32.975 ) = 2.09344e+14, which corresponds to odds of 200 trillion to 1. Further increases on the 'maxit' and decreases in 'epsilon' are vigorously ignored, .... and rightfully so.

After you do such analyses long enough you learn that coefficients above 5 are suspicious and those above 10 are usually a sign of an error in data preparation.

 (I do not think this is an example of complete separation, since the range of the predictors overlap for the two outcomes. But who know I have made (many) errors before.)

--
David,

Josh

On Sat, Sep 29, 2012 at 7:50 PM, David Winsemius <dwinsemius at comcast.net> wrote:

On Sep 29, 2012, at 7:10 AM, Josh Browning wrote:

Hi useRs,

I'm experiencing something quite weird with glm() and weights, and
maybe someone can explain what I'm doing wrong.  I have a dataset
where each row represents a single case, and I run
glm(...,family="binomial") and get my coefficients.  However, some of
my cases have the exact same values for predictor variables, so I
should be able to aggregate up my data frame and run glm(...,
family="binomial",weights=wts) and get the same coefficients (maybe
this is my incorrect assumption, but I can't see why it would be).
Anyways, here's a minimum working example below:

d = data.frame( RESP=c(rep(1,5),rep(0,5)), INDEP=c(1,1,1,1,0,0,0,0,0,0) )
glm( RESP ~ INDEP, family="binomial", data=d )

Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = d)

Coefficients:
(Intercept)        INDEP
   -1.609       21.176

Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
Null Deviance:      13.86
Residual Deviance: 5.407        AIC: 9.407

dAgg = aggregate( d$RESP, by=list(d$RESP, d$INDEP), FUN=length )
colnames(dAgg) = c("RESP","INDEP","WT")
glm( RESP ~ INDEP, family="binomial", data=dAgg, weights=WT )

Call:  glm(formula = RESP ~ INDEP, family = "binomial", data = dAgg,
  weights = WT)

Coefficients:
(Intercept)        INDEP
   -1.609       20.975

Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
Null Deviance:      13.86
Residual Deviance: 5.407        AIC: 9.407

Those two results look very similar and it is with a data situation that seems somewhat extreme. The concern is for the 1% numerical  difference in the regression coefficient? Am I reading you correctly?

--
David Winsemius, MD
Alameda, CA, USA

David Winsemius, MD
Alameda, CA, USA