Logistic Regression: variable selection based on p value?
pufftissue pufftissue wrote:
Hi, When I use logistic regression, each variable has a p value associated with it. Do I only include the variables that have a statistically significant p value (<0.05), or are there situations when I should include variables when their p values are high? I had heard that if a variable has a high p value but it's not the terminal variable, keep it; otherwise, take it out. Not sure if it's right or even why this is the case. What about if my p values are terrible but this combo of variables yields the highest AUC and calibration? What prevails in this case? Thank you!
It depends on your goals, but in general problems caused by stepwise regression arise from using P-value cutoffs that are too small rather than cutoffs that are too large. There are many reasons not to remove any variables, if you want valid confidence intervals and P-values and discrimination indexes. Note that AUC is not a great objective function; that's why we have the log likelihood. Frank
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University