Stepwise Regression and PLS - R-help

Sun, Feb 1, 2004 11:09 AM #

Dear all,

I am a newcomer to R. I intend to using R to do
stepwise regression and PLS with a data set (a 55x20
matrix, with one dependent and 19 independent
variable). Based on the same data set, I have done the
same work using SPSS and SAS. However, there is much
difference between the results obtained by R and SPSS
or SAS.

In the case of stepwise, SPSS gave out a model with 4
independent variable, but with step(), R gave out a
model with 10 and much higher R2. Furthermore,
regsubsets() also indicate the 10 variable is one of
the best regression subset. How to explain this
difference? And in the case of my data set, how many
variables that enter the model would be reasonable?

In the case of PLS, the results of mvr function of
pls.pcr package is also different with that of SAS.
Although the number of optimum latent variables is
same, the difference between R2 is much large. Why?

Any comment and suggestion is very appreciated. Thanks
in advance!

Best wishes,

Jinsong Zhao


=====
(Mr.) Jinsong Zhao
Ph.D. Candidate
School of the Environment
Nanjing University
22 Hankou Road, Nanjing 210093
P.R. China
E-mail: jinsong_zh at yahoo.com

Frank E Harrell Jr

Sun, Feb 1, 2004 11:31 AM #

On Sun, 1 Feb 2004 11:09:28 -0800 (PST)

Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

In your case SPSS, SAS, R, S-Plus, Stata, Systat, Statistica, and every
other package will agree in one sense, because results from all of them
will be virtually meaningless.  Simulate some data from a known model and
you'll quickly find out why stepwise variable selection is often a train
wreck.

---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

Jinsong Zhao

Sun, Feb 1, 2004 7:13 PM #

--- Frank E Harrell Jr <feh3k at spamcop.net> wrote:

On Sun, 1 Feb 2004 11:09:28 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

Dear all,

I am a newcomer to R. I intend to using R to do
stepwise regression and PLS with a data set (a

55x20

matrix, with one dependent and 19 independent
variable). Based on the same data set, I have done

the

same work using SPSS and SAS. However, there is

much

difference between the results obtained by R and

SPSS

or SAS.

In the case of stepwise, SPSS gave out a model

with 4

independent variable, but with step(), R gave out

model with 10 and much higher R2. Furthermore,
regsubsets() also indicate the 10 variable is one

of

the best regression subset. How to explain this
difference? And in the case of my data set, how

many

variables that enter the model would be

reasonable?

In the case of PLS, the results of mvr function of
pls.pcr package is also different with that of

SAS.

Although the number of optimum latent variables is
same, the difference between R2 is much large.

Why?

Any comment and suggestion is very appreciated.

Thanks

in advance!

Best wishes,

Jinsong Zhao

In your case SPSS, SAS, R, S-Plus, Stata, Systat,
Statistica, and every
other package will agree in one sense, because
results from all of them
will be virtually meaningless.  Simulate some data
from a known model and
you'll quickly find out why stepwise variable
selection is often a train
wreck.

---
Frank E Harrell Jr   Professor and Chair          
School of Medicine
                     Department of Biostatistics  
Vanderbilt University

For the case of stepwise regression, I have found that
the subsets I got using regsubsets() are collinear.
However, the variables in SPSS's result are not
collinear. I wonder what I should do to get a same or
better linear model.

Thanks!

Frank E Harrell Jr

Sun, Feb 1, 2004 7:42 PM #

On Sun, 1 Feb 2004 19:13:49 -0800 (PST)

Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

--- Frank E Harrell Jr <feh3k at spamcop.net> wrote:

On Sun, 1 Feb 2004 11:09:28 -0800 (PST)
Jinsong Zhao <jinsong_zh at yahoo.com> wrote:

Dear all,

I am a newcomer to R. I intend to using R to do
stepwise regression and PLS with a data set (a

55x20

matrix, with one dependent and 19 independent
variable). Based on the same data set, I have done

the

same work using SPSS and SAS. However, there is

much

difference between the results obtained by R and

SPSS

or SAS.

In the case of stepwise, SPSS gave out a model

with 4

independent variable, but with step(), R gave out

model with 10 and much higher R2. Furthermore,
regsubsets() also indicate the 10 variable is one

of

the best regression subset. How to explain this
difference? And in the case of my data set, how

many

variables that enter the model would be

reasonable?

In the case of PLS, the results of mvr function of
pls.pcr package is also different with that of

SAS.

Although the number of optimum latent variables is
same, the difference between R2 is much large.

Why?

Any comment and suggestion is very appreciated.

Thanks

in advance!

Best wishes,

Jinsong Zhao

In your case SPSS, SAS, R, S-Plus, Stata, Systat,
Statistica, and every
other package will agree in one sense, because
results from all of them
will be virtually meaningless.  Simulate some data
from a known model and
you'll quickly find out why stepwise variable
selection is often a train
wreck.

---
Frank E Harrell Jr   Professor and Chair          
School of Medicine
                     Department of Biostatistics  
Vanderbilt University

For the case of stepwise regression, I have found that
the subsets I got using regsubsets() are collinear.
However, the variables in SPSS's result are not
collinear. I wonder what I should do to get a same or
better linear model.

I think you missed the point.  None of the variable selection procedures
will provide results that have a fair probability of replicating in
another sample.

FH
---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

Jinsong Zhao

Sun, Feb 1, 2004 8:04 PM #

--- Frank E Harrell Jr <feh3k at spamcop.net> wrote:

Do you mean different procedures will provide
different results? Maybe I don't understand your email
correctly. Now, I just hope I could get a reasonable
linear model using stepwise method in R, but I don't
know how to deal with collinear problem.

=====
(Mr.) Jinsong Zhao
Ph.D. Candidate
School of the Environment
Nanjing University
22 Hankou Road, Nanjing 210093
P.R. China
E-mail: jinsong_zh at yahoo.com

Chris Lawrence

Sun, Feb 1, 2004 9:03 PM #

Jinsong Zhao wrote:

What Dr. Harrell means (in part) is that stepwise regression leads to 
models that often "overfit" the observed data pattern--i.e. models that 
are not generalizable.  More elaboration can be found here (including 
comments from Dr. Harrell):

http://www.gseis.ucla.edu/courses/ed230bc1/notes4/swprobs.html

Key quote: "Personally, I would no more let an automatic routine select 
my model than I would let some best-fit procedure pack my suitcase."  
The bottom line advice here would be: don't use stepwise regression.

Peter Kennedy, in "A Guide to Econometrics" (pp. 187-89) suggests the 
following options for dealing with collinearity:

1. "Do nothing."  The main problem in OLS when variables are collinear 
is that the estimated variances of the parameters are often inflated.
2. Obtain more data.
3. Formalize relationships among regressors (for example, in a 
simultaneous equation model).
4. Specify a relationship among the *parameters*.
5. Drop one or more variables.  (In essence, a subset of #4 where 
coefficients are set to zero.)
6. Incorporate estimates from other studies.  (A Bayesian might consider 
using a strong prior.)
7. Form a principal component from the variables, and use that instead.
8. Shrink the OLS estimates using the ridge or Stein estimators.

Hope this helps.


Chris

Dr. Chris Lawrence <cnlawren at olemiss.edu> - http://blog.lordsutch.com/