Variable Selection for data reduction and discriminant anlaysis

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20080921/393702a9/attachment.pl>
Hi Gareth,
If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.
A word of advice: You need to be exceptionally careful when analyzing
compositional data. Taking compositions puts your data values into a
constrained/bounded space (generally called a simplex) so that most standard
statistical procedures (i.e. anything that uses a Euclidean metric, and most
do) deliver erroneous results. Pearson wrote a paper on this long ago, but
it's generally been ignored (except by Aitchison and the Spanish School of
mathematical statisticians).

The problem is comparatively well known to geologists, who work with
compositional much of the time. R has a very good package for analysing this
data-type: see the compositions package  (a new release seems iminent). You
will be able to get most of the main references from it. (The authors of the
package also have a newly-released article in one of the Elsevier journals
[unfor. my bib+ are elsewhere so I cannot give details]).

You could start by Wiki'ing your way to "compositional data".

HTH, Mark.
Hello all,

I'm dealing with geochemical analyses of some rocks.

If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.  Then when I go onto do LDA with
the
6 groups, I get excellent separation.

I feel like I should be reducing the variables to thos that are providing
the most discrimination between the groups as this is important
information
for me.  I struggle to interpret the PCA plot in a way that helps me (due
to
the large number of elements).  So I'm trying to do some sort of step-wise
variable selection.

I would love to hear from someone (possibly a geochemist or similar) who
does this regularly to determine the best course of action in R to do
this.

Thanks very much

-- 
Gareth Campbell
PhD Candidate
The University of Auckland

P +649 815 3670
M +6421 256 3511
E gareth.campbell at esr.cri.nz
gcam032 at gmail.com

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://www.nabble.com/Variable-Selection-for-data-reduction-and-discriminant-anlaysis-tp19591270p19592695.html
Sent from the R help mailing list archive at Nabble.com.
There are some pointers to packages for variable selection in the task
view for Chemometrics and Computational Physics at
http://cran.r-project.org/web/views/ChemPhys.html

Hello all,

I'm dealing with geochemical analyses of some rocks.

If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.  Then when I go onto do LDA with the
6 groups, I get excellent separation.

I feel like I should be reducing the variables to thos that are providing
the most discrimination between the groups as this is important information
for me.  I struggle to interpret the PCA plot in a way that helps me (due to
the large number of elements).  So I'm trying to do some sort of step-wise
variable selection.

I would love to hear from someone (possibly a geochemist or similar) who
does this regularly to determine the best course of action in R to do this.

Thanks very much

--
Gareth Campbell
PhD Candidate
The University of Auckland

P +649 815 3670
M +6421 256 3511
E gareth.campbell at esr.cri.nz
gcam032 at gmail.com

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thanks Mark,

I failed to mention that i'm working within a compositional framework.  I
didn't want to confuse things.  My data is transformed to the clr or alr
under Aitchison geometry, so I am essentially working in Euclidean space. 

Has anyone had experience doing stepwise LDA??  I can't for the life of me
find any help online about where to start.

Thanks

Gareth

quote author="Mark Difford">
Hi Gareth,
If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.
A word of advice: You need to be exceptionally careful when analyzing
compositional data. Taking compositions puts your data values into a
constrained/bounded space (generally called a simplex) so that most standard
statistical procedures (i.e. anything that uses a Euclidean metric, and most
do) deliver erroneous results. Pearson wrote a paper on this long ago, but
it's generally been ignored (except by Aitchison and the Spanish School of
mathematical statisticians).

The problem is comparatively well known to geologists, who work with
compositional much of the time. R has a very good package for analysing this
data-type: see the compositions package  (a new release seems iminent). You
will be able to get most of the main references from it. (The authors of the
package also have a newly-released article in one of the Elsevier journals
[unfor. my bib+ are elsewhere so I cannot give details]).

You could start by Wiki'ing your way to "compositional data".

HTH, Mark.
Hello all,

I'm dealing with geochemical analyses of some rocks.

If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.  Then when I go onto do LDA with
the
6 groups, I get excellent separation.

I feel like I should be reducing the variables to thos that are providing
the most discrimination between the groups as this is important
information
for me.  I struggle to interpret the PCA plot in a way that helps me (due
to
the large number of elements).  So I'm trying to do some sort of step-wise
variable selection.

I would love to hear from someone (possibly a geochemist or similar) who
does this regularly to determine the best course of action in R to do
this.

Thanks very much

-- 
Gareth Campbell
PhD Candidate
The University of Auckland

P +649 815 3670
M +6421 256 3511
E gareth.campbell at esr.cri.nz
gcam032 at gmail.com

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://www.nabble.com/Variable-Selection-for-data-reduction-and-discriminant-anlaysis-tp19591270p19599461.html
Sent from the R help mailing list archive at Nabble.com.
Hi Gareth,
My data is transformed to the clr or alr under Aitchison geometry, so I
am essentially working 
in Euclidean space.
Great: glad to hear it.
Has anyone had experience doing stepwise LDA??  I can't for the life of
me find any help 
online about where to start.
A better option might be this: Trevor Hastie and a student of his have
recently put out a paper that does a step-up from penalized discriminant
analysis based, I think, on Trevor's sparse principal component analysis
method (in his elasticnet package).

http://www-stat.stanford.edu/~hastie/Papers/sda_line.pdf

You can get R-code to do the analysis on the first author's website; there's
a link in the paper.

Bye, Mark.
Thanks Mark,

I failed to mention that i'm working within a compositional framework.  I
didn't want to confuse things.  My data is transformed to the clr or alr
under Aitchison geometry, so I am essentially working in Euclidean space. 

Has anyone had experience doing stepwise LDA??  I can't for the life of me
find any help online about where to start.

Thanks

Gareth

quote author="Mark Difford">
Hi Gareth,

If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.
A word of advice: You need to be exceptionally careful when analyzing
compositional data. Taking compositions puts your data values into a
constrained/bounded space (generally called a simplex) so that most
standard statistical procedures (i.e. anything that uses a Euclidean
metric, and most do) deliver erroneous results. Pearson wrote a paper on
this long ago, but it's generally been ignored (except by Aitchison and
the Spanish School of mathematical statisticians).

The problem is comparatively well known to geologists, who work with
compositional much of the time. R has a very good package for analysing
this data-type: see the compositions package  (a new release seems
iminent). You will be able to get most of the main references from it.
(The authors of the package also have a newly-released article in one of
the Elsevier journals [unfor. my bib+ are elsewhere so I cannot give
details]).

You could start by Wiki'ing your way to "compositional data".

HTH, Mark.

Gareth Campbell wrote:
Hello all,

I'm dealing with geochemical analyses of some rocks.

If I use the full composition (31 elements or variables), I can get
reasonable separation of my 6 sources.  Then when I go onto do LDA with
the
6 groups, I get excellent separation.

I feel like I should be reducing the variables to thos that are providing
the most discrimination between the groups as this is important
information
for me.  I struggle to interpret the PCA plot in a way that helps me (due
to
the large number of elements).  So I'm trying to do some sort of
step-wise
variable selection.

I would love to hear from someone (possibly a geochemist or similar) who
does this regularly to determine the best course of action in R to do
this.

Thanks very much

-- 
Gareth Campbell
PhD Candidate
The University of Auckland

P +649 815 3670
M +6421 256 3511
E gareth.campbell at esr.cri.nz
gcam032 at gmail.com

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

View this message in context: http://www.nabble.com/Variable-Selection-for-data-reduction-and-discriminant-anlaysis-tp19591270p19602702.html
Sent from the R help mailing list archive at Nabble.com.