binomial glm for relevant feature selection?

Sun, Nov 10, 2002 3:50 PM #

As suggested in my earlier message, I have a large population of 
independent variables and a binary dependent outcome.  It is expected 
that only a few of the independent variables actually contribute to the 
outcome, and I'd like to find those.

If it wasn't already obvious, I am *not* a statistician.  Not even 
close.  :-)  Statistician colleagues have suggested that I use logistic 
regression for this problem.  My understanding is that logistic 
regression is available in R as glm(..., family=binomial).

When I use this solver on fictitious data, though, the answers I expect 
are not the answers I see.  Consider the following fictitious data, 
where "z" is the dependent binary outcome, "y" is irrelevant noise, and 
"x" is actually relevant to predicting the outcome:

	  x y z
	1 8 7 1
	2 8 3 1
	3 0 5 0
	4 0 9 0
	5 8 1 1

If I feed this data to glm(z ~ x + y) using the default gaussian family, 
the results make some sense to me.  The estimated coefficient for x is 
positive and the corresponding "Pr(>|t|)" value is tiny (<2e-16), which 
I take to imply a high degree of confidence that larger values of x 
correlate with increased likelihood of z.  Conversely, the estimated 
coefficient for y has a "Pr(>|t|)" value of 0.552, which I take to imply 
that there is no strong correlation between y and z.  Good.

However, I've been told that I want to use family=binomial for a 
logistic regression problem with a binary dependent outcome like this. 
If I give this data to glm(z ~ x + y, family=binomial), the results 
become quite mysterious.  I receive a warning that "Algorithm did not 
converge".  The "Pr(>|t|)" values for x and y are 0.916 and 1.000 
respectively, which would seem to indicate that neither one correlates 
with the outcome.

I realize that this is not a problem with R.  It is a problem with my 
understanding of what R is doing.  But you all have been so helpful thus 
far, perhaps I can impose on you to give me one more clue?  What am I 
doing wrong here?  What should I be looking at that I'm not?

Thank you, once again!

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Sun, Nov 10, 2002 11:32 PM #

On Sun, 10 Nov 2002, Ben Liblit wrote:

As suggested in my earlier message, I have a large population of
independent variables and a binary dependent outcome.  It is expected
that only a few of the independent variables actually contribute to the
outcome, and I'd like to find those.

If it wasn't already obvious, I am *not* a statistician.  Not even
close.  :-)  Statistician colleagues have suggested that I use logistic
regression for this problem.  My understanding is that logistic
regression is available in R as glm(..., family=binomial).

When I use this solver on fictitious data, though, the answers I expect
are not the answers I see.  Consider the following fictitious data,
where "z" is the dependent binary outcome, "y" is irrelevant noise, and
"x" is actually relevant to predicting the outcome:

	  x y z
	1 8 7 1
	2 8 3 1
	3 0 5 0
	4 0 9 0
	5 8 1 1

If I feed this data to glm(z ~ x + y) using the default gaussian family,
the results make some sense to me.  The estimated coefficient for x is
positive and the corresponding "Pr(>|t|)" value is tiny (<2e-16), which
I take to imply a high degree of confidence that larger values of x
correlate with increased likelihood of z.  Conversely, the estimated
coefficient for y has a "Pr(>|t|)" value of 0.552, which I take to imply
that there is no strong correlation between y and z.  Good.

However, I've been told that I want to use family=binomial for a
logistic regression problem with a binary dependent outcome like this.
If I give this data to glm(z ~ x + y, family=binomial), the results
become quite mysterious.  I receive a warning that "Algorithm did not
converge".  The "Pr(>|t|)" values for x and y are 0.916 and 1.000
respectively, which would seem to indicate that neither one correlates
with the outcome.

I realize that this is not a problem with R.  It is a problem with my
understanding of what R is doing.  But you all have been so helpful thus
far, perhaps I can impose on you to give me one more clue?  What am I
doing wrong here?  What should I be looking at that I'm not?

Your problem is linearly separable, and you are seeing the Hauck-Donner
effect.  This is rare (but by no means unknown) in real problems, and
means the Wald test as used by the t values is unreliable.

More details in Venables & Ripley (1999, 2002), look Hauck-Donner up in
the index.  It's a technical point and the explanation is technical, but
there is also a practical summary there.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._