Message-ID: <20040201143134.1dbd5904.feh3k@spamcop.net>
Date: 2004-02-01T19:31:34Z
From: Frank E Harrell Jr
Subject: Stepwise Regression and PLS
In-Reply-To: <20040201190928.41718.qmail@web20805.mail.yahoo.com>
On Sun, 1 Feb 2004 11:09:28 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:
> Dear all,
>
> I am a newcomer to R. I intend to using R to do
> stepwise regression and PLS with a data set (a 55x20
> matrix, with one dependent and 19 independent
> variable). Based on the same data set, I have done the
> same work using SPSS and SAS. However, there is much
> difference between the results obtained by R and SPSS
> or SAS.
>
> In the case of stepwise, SPSS gave out a model with 4
> independent variable, but with step(), R gave out a
> model with 10 and much higher R2. Furthermore,
> regsubsets() also indicate the 10 variable is one of
> the best regression subset. How to explain this
> difference? And in the case of my data set, how many
> variables that enter the model would be reasonable?
>
> In the case of PLS, the results of mvr function of
> pls.pcr package is also different with that of SAS.
> Although the number of optimum latent variables is
> same, the difference between R2 is much large. Why?
>
> Any comment and suggestion is very appreciated. Thanks
> in advance!
>
> Best wishes,
>
> Jinsong Zhao
>
In your case SPSS, SAS, R, S-Plus, Stata, Systat, Statistica, and every
other package will agree in one sense, because results from all of them
will be virtually meaningless. Simulate some data from a known model and
you'll quickly find out why stepwise variable selection is often a train
wreck.
---
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University