Dear all, I would like to perform a regression tree analysis on a dataset with multicollinear variables (as climate variables often are). The questions that I am asking are: 1- Is there any particular statistical problem in using multicollinear variables in a regression tree? 2- Multicollinear variables should appear as alternate splits. Would it be more accurate to present these alternate splits in the results of the analysis or apply a variable selection or reduction procedure before the regression tree? Thank you in advance, Jean-Noel Candau INRA - Unit? de Recherches Foresti?res M?diterran?ennes Avenue A. Vivaldi 84000 AVIGNON Tel: (33) 4 90 13 59 22 Fax: (33) 4 90 13 59 59
Recursive partitioning with multicollinear variables
2 messages · Jean-Noel, Frank E Harrell Jr
On Mon, 9 Feb 2004 11:24:39 +0100
"Jean-Noel" <jean-noel.candau at avignon.inra.fr> wrote:
Dear all, I would like to perform a regression tree analysis on a dataset with multicollinear variables (as climate variables often are). The questions that I am asking are: 1- Is there any particular statistical problem in using multicollinear variables in a regression tree? 2- Multicollinear variables should appear as alternate splits. Would it be more accurate to present these alternate splits in the results of the analysis or apply a variable selection or reduction procedure before the regression tree? Thank you in advance, Jean-Noel Candau
A more accurate and stable result would be obtained by performing a data
reduction procedure that ignores the response variable. Combining
collinear variables into an index is often better than arbitrarily
choosing between them. Then use the indexes in a regression model unless
you have tens of thousands of observations for recursive partitioning, or
are using bagging of trees or a related procedure to cancel out the
instability in the tree growing process [which unfortunately will often
result in an average of trees that is more complex in appearance than a
regression model].
Frank
---
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University