Principal Component Analysis - Selectingcomponents? + right choice?

Thu, Dec 11, 2008 6:44 AM

Hi all,

I used a rule of thumb as reported by the book quoted, but I am not completely 
happy with it, because it is not really a statistical justification.

I will try the broken stick approach, thanks!

Concerning the interpretation, luckily enough PC1 has a clear interpretation. 
PC2 a bit less so, though .... and the complexity of interpretation increases 
with explained variance decreasing.

I am using the approach suggested by Kamran: I have brewed down the original 
climatic variables to uncorrelated "environmental variables", and I have 
chosen the signification ones using a threshold. I do realise spatial 
auto-correlation is going to be important (even if my sites are fairly 
distant from one another, reducing the impact), but  do not know anything 
about spatial filtering. Whilst I have used distance matrices, I have never 
used them to remove spatial auto-correlation!

Could you please point me out to some resources please?

Best,

On Thursday 11 December 2008 14:29:27 Kamran Safi wrote:

Hi all,

I agree with Ashton. The issue is very complex and far from resolved.
But sometimes we have to go down the PCA path. Among the many possible
solutions is the broken stick approach, for which you find an R solution
(bstick()) in the package vegan. Technically the broken stick randomly
divides 100% variance into your N principal components and generates a
null expectation for the distributions of randomly partioning the
original variance. You then take all those PCAs that are above the
broken stick distribution. This is by no means an agreed upon approach,
but it is at least reproducible and has some theory behind it, but is
and will remain a rule of thumb.
In terms of spatial analysis you could derive the PCAs and then go into
classic spatial analysis. Although the interpretation of PCA is
sometimes complicated or even impossible, you can calculate the values
for every grid cell and then go into multivariate analysis whereby you
have to take spatial autocorrelation into account. At least your PCA
components are orthogonal, which simplifies your analysis in contrast to
using the original variables. It also allows you to produce predictive
models.
What you could think of doing could be using PCA to derive
"environmental" variables which are uncorrelated and the use the
distance matrix and spatial filtering to "remove" spatial
autocorrelation.

Hope this helps,

Kami



------------------------
Kamran Safi

Postdoctoral Research Fellow
Institute of Zoology
Zoological Society of London
Regent's Park
London NW1 4RY

http://www.zoo.cam.ac.uk/ioz/people/safi.htm

http://spatialr.googlepages.com
http://asapi.wetpaint.com

-----Original Message-----
From: r-sig-geo-bounces at stat.math.ethz.ch
[mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Ashton
Shortridge
Sent: 11 December 2008 14:11
To: r-sig-geo at stat.math.ethz.ch
Cc: Corrado
Subject: Re: [R-sig-Geo] Principal Component Analysis -
Selectingcomponents? + right choice?

Hi Corrado,

I run the PCA using prcomp, quite successfully. Now I need to use a
criteria to select the right number of PC. (that is: is it 1,2,3,4?)

What criteria would you suggest?

that's an interesting and probably controversy-generating question. It's

probably not an R-sig-geo question, either. I am not a PCA person, but
the
rule of thumb I am aware of is to plot the variability each
component 'explains' and look for a clear breakpoint. I would think
about any
multivariate analysis text would have a better explanation than I can
give,
though.

As for something more rigorous, I think a lot of people are reluctant to
use
PCA as a modeling approach not so much because it's hard to choose a
threshold for selecting components, but because the interpretation of
the
meaning of each component is pretty subjective. If you want an
explanatory
model, be careful about using PCA. You would be better served by
deciding,
based perhaps on expert knowledge about the variables, which ones to use
in
the model and which ones not to.

To try to make this a bit more spatial, and therefore more relevant to
the
list, I will also warn you that your various climate variables are
almost
certainly spatially autocorrelated - that is, neighboring and nearby
observations in the grid are not independent. That has serious
implications
for standard multivariate analysis techniques and diagnostics.

Yours,

Ashton

On Thursday 11 December 2008 06:46:37 am Corrado wrote:

Dear R gurus,

I have some climatic data for a region of the world. They are monthly
averages 1950 -2000 of precipitation (12 months), minimum temperature

(12

months), maximum temperature (12 months). I have scaled them to 2 km x

2km

cells, and I have around 75,000 cells.

I need to feed them into a statistical model as co-variates, to use

them to

predict a response variable.

The climatic data are obviously correlated: precipitation for January

is

correlated to precipitation for February and so on .... even

precipitation

and temperature are heavily correlated. I did some correlation

analysis and

they are all strongly correlated.

I though of running PCA on them, in order to reduce the number of
co-variates I feed into the model.

I run the PCA using prcomp, quite successfully. Now I need to use a
criteria to select the right number of PC. (that is: is it 1,2,3,4?)

What criteria would you suggest?

At the moment, I am using a criteria based on threshold, but that is

highly

subjective, even if there are some rules of thumb (Jolliffe,Principal
Component Analysis, II Edition, Springer Verlag,2002).

Could you suggest something more rigorous?

By the way, do you think I would have been better off by using

something

different from PCA?

Best,

Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk

Principal Component Analysis - Selectingcomponents? + right choice?

Thread (10 messages)