warning associated with Logistic Regression

Sun, Jan 25, 2004 10:06 AM
On 25-Jan-04 Guillem Chust wrote:
This is so. Indeed, there is a sense in which you are experiencing
unusually good fortune, since for values of your predictors in one
region you are perfectly predicting the 0s in your reponse, and for
values in another region your a perfectly predicting the 1s. What
better could you hope for?

However, you would respond that this is not realistic: your variables
are not (in real life) such that P(Y=1|X=x) is ever exactly 1 or
exactly 0, so this perfect prediction is not realistic.

In that case, you are somewhat stuck. The plain fact is that your
data (in particular the way the values of the X variables are distributed)
are not adequate to tell you what is happening.

There may be manipulative tricks (like penalised regression) which
would inhibit the logistic regression from going all the way to a
perfect fit; but, then, how would you know how far to let it go
(because it will certainly go as far in that direction as you allow
it to).

The key parameter in this situation the dispersion parameter (sigma
in the usual notation). When you get perfect fit in a "completely
separated" situation, this corresponds to sigma=0. If you don't like
this, then there must be reasons why you want sigma>0 and this may
imply that you have reasons for wanting sigma to be at least s0 (say),
or, if you are prepared to be Bayesian about it, you may be satisfied
that there is a prior distribution for sigma which would not allow
sigma=0, and would attach high probability to a range of sigma values
which you condisder to be realistic.

Unless you have a fairly firm idea of what sort of values sigma is
likely to havem then you are indeed stuck because you have no reason
to prefer one positive value of sigma to a different positive value
of sigma. In that case you cannot really object if the logistic
regression tries to make it as small as possible!

In the absence of such reasons, you may consider exploring the
effect of fixing sigma at some positive value, and then varying this
value. For each such value, look at the estimates of the coefficients
of the X variables, the goodness of fit, and so on. This may help you
to form an idea of what sort of estimate you should hope for, and
would enable you to design a better dataset (i.e. placement of X values)
which would be capable of supporting a fit which was both realistic
and estimated with adequate precision.

Another approach you should consider, if you have several X variables,
is to look at subsets of these variables, retaining in the first
instance only those few (the fewer the better) which on substantive
grounds you considered to be the most important in the application
to which the data refer. Also look at the multivariate distribution
of the X values and in particular carry out a linear discriminant
anaysis on them.

If, however, you have only 1 X variable, then you have a situation
equivalent to the following (pairs of (x,y)):

  (-2,0), (-1,9), (0,0), (1,1), (2,1), (3,1).

clearly you are not going to get anything out of this unless you
either repeat the experiment many times (so that you have several
Y responses at each value of X, and probabilities between 0 and 1
at each X then have a better chance to express themselves, as so
many 0s and also so many 1s at each X), or you fill in the range
over which P(Y=1|X=x) increases from low to high, e.g. by observing
Y for X = -1.0, -0.9, -0.8, ... , 0.0, 0.1, ... 1.9, 2.0 (say).

I hope these suggestions help.
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 25-Jan-04                                       Time: 18:06:16
------------------------------ XFMail ------------------------------
warning associated with Logistic Regression

Thread (6 messages)