Skip to content

Variable Selection for data reduction and discriminant anlaysis

5 messages · Gareth Campbell, Katharine Mullen, gcam032 +1 more

#
Hi Gareth,
A word of advice: You need to be exceptionally careful when analyzing
compositional data. Taking compositions puts your data values into a
constrained/bounded space (generally called a simplex) so that most standard
statistical procedures (i.e. anything that uses a Euclidean metric, and most
do) deliver erroneous results. Pearson wrote a paper on this long ago, but
it's generally been ignored (except by Aitchison and the Spanish School of
mathematical statisticians).

The problem is comparatively well known to geologists, who work with
compositional much of the time. R has a very good package for analysing this
data-type: see the compositions package  (a new release seems iminent). You
will be able to get most of the main references from it. (The authors of the
package also have a newly-released article in one of the Elsevier journals
[unfor. my bib+ are elsewhere so I cannot give details]).

You could start by Wiki'ing your way to "compositional data".

HTH, Mark.
Gareth Campbell wrote:

  
    
#
There are some pointers to packages for variable selection in the task
view for Chemometrics and Computational Physics at
http://cran.r-project.org/web/views/ChemPhys.html
On Sun, 21 Sep 2008, Gareth Campbell wrote:

            
#
Thanks Mark,

I failed to mention that i'm working within a compositional framework.  I
didn't want to confuse things.  My data is transformed to the clr or alr
under Aitchison geometry, so I am essentially working in Euclidean space. 

Has anyone had experience doing stepwise LDA??  I can't for the life of me
find any help online about where to start.

Thanks

Gareth


quote author="Mark Difford">
Hi Gareth,
A word of advice: You need to be exceptionally careful when analyzing
compositional data. Taking compositions puts your data values into a
constrained/bounded space (generally called a simplex) so that most standard
statistical procedures (i.e. anything that uses a Euclidean metric, and most
do) deliver erroneous results. Pearson wrote a paper on this long ago, but
it's generally been ignored (except by Aitchison and the Spanish School of
mathematical statisticians).

The problem is comparatively well known to geologists, who work with
compositional much of the time. R has a very good package for analysing this
data-type: see the compositions package  (a new release seems iminent). You
will be able to get most of the main references from it. (The authors of the
package also have a newly-released article in one of the Elsevier journals
[unfor. my bib+ are elsewhere so I cannot give details]).

You could start by Wiki'ing your way to "compositional data".

HTH, Mark.
Gareth Campbell wrote:

  
    
#
Hi Gareth,
Great: glad to hear it.
A better option might be this: Trevor Hastie and a student of his have
recently put out a paper that does a step-up from penalized discriminant
analysis based, I think, on Trevor's sparse principal component analysis
method (in his elasticnet package).

http://www-stat.stanford.edu/~hastie/Papers/sda_line.pdf

You can get R-code to do the analysis on the first author's website; there's
a link in the paper.

Bye, Mark.
gcam032 wrote: