Variable Importance in pls: R or B? (and in glpls?)

Christoph:

I noted that there were not a great number of people leaping to reply. One
reason, I suspect, is that there's really NO GOOD ANSWER to this question.
First, there is a huge literature on this -- it's related to variable
selection in regression and shrinkage estimates, but, in general,
parsimonious model building; second, as Ron Wehrens already noted, when
variables are correlated -- which could have as much to do with the vagaries
of the sampling as to real physical causality -- the whole notion of
"variable importance" is problematic. Fact is, **any** attempt to rank the
contributions of particular variables to PLS components from undesigned data
(the usual case) is fraught with hazard. For that reason, it is perhaps best
to view pls as merely a way of developing a good predictor, not as a way to
uncover causal relationships. I know this is often unsatisfying to
scientists trying to build parsimonious mechanistic models (= physical
theories), especially as there is quite often little likelihood that the
data are representative of any underlying population and therefore capable
of predicting anything, but it is the statistical reality.

For a more informed, more interesting, and more eloquent discussion of these
and related issues, you might look up Leo Breiman's writings on his web site
and his way of trying to assess "variable importance" in his Random Forest
methodology, which is available in the package randomForest on CRAN. (I make
no claim about the effectiveness of this approach -- only that it is clearly
different way of approaching the issue that clearly reveals the dilemmas).

Cheers,

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

Variable Importance in pls: R or B? (and in glpls?)

Thread (3 messages)