Variable Importance in pls: R or B? (and in glpls?)
Christoph: I noted that there were not a great number of people leaping to reply. One reason, I suspect, is that there's really NO GOOD ANSWER to this question. First, there is a huge literature on this -- it's related to variable selection in regression and shrinkage estimates, but, in general, parsimonious model building; second, as Ron Wehrens already noted, when variables are correlated -- which could have as much to do with the vagaries of the sampling as to real physical causality -- the whole notion of "variable importance" is problematic. Fact is, **any** attempt to rank the contributions of particular variables to PLS components from undesigned data (the usual case) is fraught with hazard. For that reason, it is perhaps best to view pls as merely a way of developing a good predictor, not as a way to uncover causal relationships. I know this is often unsatisfying to scientists trying to build parsimonious mechanistic models (= physical theories), especially as there is quite often little likelihood that the data are representative of any underlying population and therefore capable of predicting anything, but it is the statistical reality. For a more informed, more interesting, and more eloquent discussion of these and related issues, you might look up Leo Breiman's writings on his web site and his way of trying to assess "variable importance" in his Random Forest methodology, which is available in the package randomForest on CRAN. (I make no claim about the effectiveness of this approach -- only that it is clearly different way of approaching the issue that clearly reveals the dilemmas). Cheers, -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Christoph Lehmann Sent: Sunday, September 12, 2004 5:13 AM To: Ron Wehrens; r-help at stat.math.ethz.ch Subject: [R] Variable Importance in pls: R or B? (and in glpls?) Dear R-users, dear Ron I use pls from the pls.pcr package for classification. Since I need to know which variables are most influential onto the classification performance, what criteria shall I look at: a) B, the array of regression coefficients for a certain model (means a certain number of latent variables) (and: squared or absolute values?) OR b) the weight matrix RR (or R in the De Jong publication; in Ding & Gentleman this is the P Matrix and called 'loadings')? (and again: squared or absolute values?) and what about glpls (glpls1a) ? shall I look at the 'coefficients' (regression coefficients)? Thanks for clarification Christoph
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html