Skip to content

Variable Importance in pls: R or B? (and in glpls?)

3 messages · Christoph Lehmann, Ron Wehrens, Bert Gunter

#
Dear R-users, dear Ron

I use pls from the pls.pcr package for classification. Since I need to 
know which variables are most influential onto the classification 
performance, what criteria shall I look at:

a) B, the array of regression coefficients for a certain model (means a 
certain number of latent variables) (and: squared or absolute values?)

OR

b) the weight matrix RR (or R in the De Jong publication; in Ding & 
Gentleman this is the P Matrix and called 'loadings')? (and again: 
squared or absolute values?)



and what about glpls (glpls1a) ?
shall I look at the 'coefficients' (regression coefficients)?

Thanks for clarification

Christoph
#
On Sunday 12 September 2004 14:12, Christoph Lehmann wrote:
The regression coefficients give the most direct information on which 
variables influence the classification, although you must be careful with the 
interpretation if the variables are correlated. So it is the absolute 
magitude that is important; why would you look at the squared values?
The object that is returned contains X and Y loadings (which are _not_ equal 
to te RR matrix, btw); these are mainly used for interpretation. The 
regression coefficients give information on your complete model; the loadings 
on individual components of the model.

Ron

  
    
#
Christoph:

I noted that there were not a great number of people leaping to reply. One
reason, I suspect, is that there's really NO GOOD ANSWER to this question.
First, there is a huge literature on this -- it's related to variable
selection in regression and shrinkage estimates, but, in general,
parsimonious model building; second, as Ron Wehrens already noted, when
variables are correlated -- which could have as much to do with the vagaries
of the sampling as to real physical causality -- the whole notion of
"variable importance" is problematic. Fact is, **any** attempt to rank the
contributions of particular variables to PLS components from undesigned data
(the usual case) is fraught with hazard. For that reason, it is perhaps best
to view pls as merely a way of developing a good predictor, not as a way to
uncover causal relationships. I know this is often unsatisfying to
scientists trying to build parsimonious mechanistic models (= physical
theories), especially as there is quite often little likelihood that the
data are representative of any underlying population and therefore capable
of predicting anything, but it is the statistical reality.

For a more informed, more interesting, and more eloquent discussion of these
and related issues, you might look up Leo Breiman's writings on his web site
and his way of trying to assess "variable importance" in his Random Forest
methodology, which is available in the package randomForest on CRAN. (I make
no claim about the effectiveness of this approach -- only that it is clearly
different way of approaching the issue that clearly reveals the dilemmas).

Cheers,

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box