Skip to content
Back to formatted view

Raw Message

Message-ID: <4937E0BF.2010609@vanderbilt.edu>
Date: 2008-12-04T13:53:03Z
From: Frank E Harrell Jr
Subject: Logistic Regression: variable selection based on p value?
In-Reply-To: <73b1681b0812032230n517f8905paad82dd91ca46fb1@mail.gmail.com>

pufftissue pufftissue wrote:
> Hi,
> 
> When I use logistic regression, each variable has a p value associated with
> it.  Do I only include the variables that have a statistically significant p
> value (<0.05), or are there situations when I should include variables when
> their p values are high?  I had heard that if a variable has a high p value
> but it's not the terminal variable, keep it; otherwise, take it out.  Not
> sure if it's right or even why this is the case.  What about if my p values
> are terrible but this combo of variables yields the highest AUC and
> calibration?  What prevails in this case?
> 
> Thank you!

It depends on your goals, but in general problems caused by stepwise 
regression arise from using P-value cutoffs that are too small rather 
than cutoffs that are too large.  There are many reasons not to remove 
any variables, if you want valid confidence intervals and P-values and 
discrimination indexes.  Note that AUC is not a great objective 
function; that's why we have the log likelihood.

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University