Principal Component Analysis - Selectingcomponents? + right choice?
Hi all, I agree with Ashton. The issue is very complex and far from resolved. But sometimes we have to go down the PCA path. Among the many possible solutions is the broken stick approach, for which you find an R solution (bstick()) in the package vegan. Technically the broken stick randomly divides 100% variance into your N principal components and generates a null expectation for the distributions of randomly partioning the original variance. You then take all those PCAs that are above the broken stick distribution. This is by no means an agreed upon approach, but it is at least reproducible and has some theory behind it, but is and will remain a rule of thumb. In terms of spatial analysis you could derive the PCAs and then go into classic spatial analysis. Although the interpretation of PCA is sometimes complicated or even impossible, you can calculate the values for every grid cell and then go into multivariate analysis whereby you have to take spatial autocorrelation into account. At least your PCA components are orthogonal, which simplifies your analysis in contrast to using the original variables. It also allows you to produce predictive models. What you could think of doing could be using PCA to derive "environmental" variables which are uncorrelated and the use the distance matrix and spatial filtering to "remove" spatial autocorrelation. Hope this helps, Kami ------------------------ Kamran Safi Postdoctoral Research Fellow Institute of Zoology Zoological Society of London Regent's Park London NW1 4RY http://www.zoo.cam.ac.uk/ioz/people/safi.htm http://spatialr.googlepages.com http://asapi.wetpaint.com -----Original Message----- From: r-sig-geo-bounces at stat.math.ethz.ch [mailto:r-sig-geo-bounces at stat.math.ethz.ch] On Behalf Of Ashton Shortridge Sent: 11 December 2008 14:11 To: r-sig-geo at stat.math.ethz.ch Cc: Corrado Subject: Re: [R-sig-Geo] Principal Component Analysis - Selectingcomponents? + right choice? Hi Corrado,
I run the PCA using prcomp, quite successfully. Now I need to use a criteria to select the right number of PC. (that is: is it 1,2,3,4?) What criteria would you suggest?
that's an interesting and probably controversy-generating question. It's probably not an R-sig-geo question, either. I am not a PCA person, but the rule of thumb I am aware of is to plot the variability each component 'explains' and look for a clear breakpoint. I would think about any multivariate analysis text would have a better explanation than I can give, though. As for something more rigorous, I think a lot of people are reluctant to use PCA as a modeling approach not so much because it's hard to choose a threshold for selecting components, but because the interpretation of the meaning of each component is pretty subjective. If you want an explanatory model, be careful about using PCA. You would be better served by deciding, based perhaps on expert knowledge about the variables, which ones to use in the model and which ones not to. To try to make this a bit more spatial, and therefore more relevant to the list, I will also warn you that your various climate variables are almost certainly spatially autocorrelated - that is, neighboring and nearby observations in the grid are not independent. That has serious implications for standard multivariate analysis techniques and diagnostics. Yours, Ashton
On Thursday 11 December 2008 06:46:37 am Corrado wrote:
Dear R gurus, I have some climatic data for a region of the world. They are monthly averages 1950 -2000 of precipitation (12 months), minimum temperature
(12
months), maximum temperature (12 months). I have scaled them to 2 km x
2km
cells, and I have around 75,000 cells. I need to feed them into a statistical model as co-variates, to use
them to
predict a response variable. The climatic data are obviously correlated: precipitation for January
is
correlated to precipitation for February and so on .... even
precipitation
and temperature are heavily correlated. I did some correlation
analysis and
they are all strongly correlated. I though of running PCA on them, in order to reduce the number of co-variates I feed into the model. I run the PCA using prcomp, quite successfully. Now I need to use a criteria to select the right number of PC. (that is: is it 1,2,3,4?) What criteria would you suggest? At the moment, I am using a criteria based on threshold, but that is
highly
subjective, even if there are some rules of thumb (Jolliffe,Principal Component Analysis, II Edition, Springer Verlag,2002). Could you suggest something more rigorous? By the way, do you think I would have been better off by using
something
different from PCA? Best,
Ashton Shortridge Associate Professor ashton at msu.edu Dept of Geography http://www.msu.edu/~ashton 235 Geography Building ph (517) 432-3561 Michigan State University fx (517) 432-1671 _______________________________________________ R-sig-Geo mailing list R-sig-Geo at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-geo Click https://www.mailcontrol.com/sr/wQw0zmjPoHdJTZGyOCrrhg== dsq0SUeqeT9ZUvqzszeURfMOCDRy5!TUZWDEGNPdlyNkQ== to report this email as spam. The Zoological Society of London is incorporated by Royal Charter Principal Office England. Company Number RC000749 Registered address: Regent's Park, London, England NW1 4RY Registered Charity in England and Wales no. 208728 _________________________________________________________________________ This e-mail has been sent in confidence to the named add...{{dropped:17}}