An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20081204/4845acaa/attachment.pl>
Logistic Regression: variable selection based on p value?
3 messages · pufftissue pufftissue, Erik Iverson, Frank E Harrell Jr
Puff - There are many strategies, ideas, and literature on this topic. A great introduction that leads to many of the references that are interesting is Frank Harrell's book, "Regression Modeling Strategies". I would highly recommend it.
pufftissue pufftissue wrote:
Hi, When I use logistic regression, each variable has a p value associated with it. Do I only include the variables that have a statistically significant p value (<0.05), or are there situations when I should include variables when their p values are high? I had heard that if a variable has a high p value but it's not the terminal variable, keep it; otherwise, take it out. Not sure if it's right or even why this is the case. What about if my p values are terrible but this combo of variables yields the highest AUC and calibration? What prevails in this case? Thank you! [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
pufftissue pufftissue wrote:
Hi, When I use logistic regression, each variable has a p value associated with it. Do I only include the variables that have a statistically significant p value (<0.05), or are there situations when I should include variables when their p values are high? I had heard that if a variable has a high p value but it's not the terminal variable, keep it; otherwise, take it out. Not sure if it's right or even why this is the case. What about if my p values are terrible but this combo of variables yields the highest AUC and calibration? What prevails in this case? Thank you!
It depends on your goals, but in general problems caused by stepwise regression arise from using P-value cutoffs that are too small rather than cutoffs that are too large. There are many reasons not to remove any variables, if you want valid confidence intervals and P-values and discrimination indexes. Note that AUC is not a great objective function; that's why we have the log likelihood. Frank
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University