Skip to content
Back to formatted view

Raw Message

Message-ID: <20040201143134.1dbd5904.feh3k@spamcop.net>
Date: 2004-02-01T19:31:34Z
From: Frank E Harrell Jr
Subject: Stepwise Regression and PLS
In-Reply-To: <20040201190928.41718.qmail@web20805.mail.yahoo.com>

On Sun, 1 Feb 2004 11:09:28 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

> Dear all,
> 
> I am a newcomer to R. I intend to using R to do
> stepwise regression and PLS with a data set (a 55x20
> matrix, with one dependent and 19 independent
> variable). Based on the same data set, I have done the
> same work using SPSS and SAS. However, there is much
> difference between the results obtained by R and SPSS
> or SAS.
> 
> In the case of stepwise, SPSS gave out a model with 4
> independent variable, but with step(), R gave out a
> model with 10 and much higher R2. Furthermore,
> regsubsets() also indicate the 10 variable is one of
> the best regression subset. How to explain this
> difference? And in the case of my data set, how many
> variables that enter the model would be reasonable?
> 
> In the case of PLS, the results of mvr function of
> pls.pcr package is also different with that of SAS.
> Although the number of optimum latent variables is
> same, the difference between R2 is much large. Why?
> 
> Any comment and suggestion is very appreciated. Thanks
> in advance!
> 
> Best wishes,
> 
> Jinsong Zhao
> 

In your case SPSS, SAS, R, S-Plus, Stata, Systat, Statistica, and every
other package will agree in one sense, because results from all of them
will be virtually meaningless.  Simulate some data from a known model and
you'll quickly find out why stepwise variable selection is often a train
wreck.

---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University