Skip to content

how to tell if its better to standardize your data matrix first when you do principal

10 messages · masterinex, Hadley Wickham, Uwe Ligges +1 more

#
Hi guys , 

Im trying to do principal component analysis in R . There is 2 ways of doing
it , I believe. 
One is doing  principal component analysis right away the other way is 
standardizing the matrix first  using s = scale(m)and then apply principal
component analysis.   
How  do I tell what result is better ? What values in particular should i
look at . I already managed to find the eigenvalues and eigenvectors , the
proportion of  variance for each eigenvector using both methods.



I noticed that the proportion of the variance for the first  pca without
standardizing had a larger  value . Is there a meaning to it ? Isnt this
always the case?
 At last , if I am  supposed to predict a variable ie weight should I drop
the variable ie weight from my data matrix when I do principal component
analysis ?
#
masterinex wrote:
Generally, it is better to standardize. But in some cases, e.g. for the 
same units in your variables indicating also the importance, it might 
make sense not to do so.
You should think about the analysis, you cannot know which result is 
`better' unless you know an interpretation.
This sounds a bit like homework. If that is the case, please ask your 
teacher rather than this list.
Anyway, it does not make sense to predict weight using a linear 
combination (principle component) that contains weight, does it?

Uwe Ligges
#
so under which cases is it better to  standardize  the data matrix first ?
also  is  PCA generally used to predict the response variable , should I
keep that variable in my data matrix ?
Uwe Ligges-3 wrote:

  
    
#
You've asked the same question on stackoverflow.com and received the
same answer.  This is rude because it duplicates effort.  If you
urgently need a response to a question, perhaps you should consider
paying for it.

Hadley
On Sun, Nov 22, 2009 at 12:04 PM, masterinex <xevilgang79 at hotmail.com> wrote:

  
    
#
Hi Hadley , 

I really apreciate the suggestions you gave, It was helpful , but I still
didnt quite get it all.   and I really want to do a good job , so any
comments would sure come helpful, please understand me .
hadley wrote:

  
    
#
this is how my data matrix looks like . This is just for the first 10
observations , but the pattern is similar for the other observations.  


1    12.3 154.25  67.75 36.2  93.1  85.2  94.5  59.0 37.3  21.9   32.0   
27.4  17.1
2     6.1 173.25  72.25 38.5  93.6  83.0  98.7  58.7 37.3  23.4   30.5   
28.9  18.2
3    25.3 154.00  66.25 34.0  95.8  87.9  99.2  59.6 38.9  24.0   28.8   
25.2  16.6
4    10.4 184.75  72.25 37.4 101.8  86.4 101.2  60.1 37.3  22.8   32.4   
29.4  18.2
5    28.7 184.25  71.25 34.4  97.3 100.0 101.9  63.2 42.2  24.0   32.2   
27.7  17.7
6    20.9 210.25  74.75 39.0 104.5  94.4 107.8  66.0 42.0  25.6   35.7   
30.6  18.8
7    19.2 181.00  69.75 36.4 105.1  90.7 100.3  58.4 38.3  22.9   31.9   
27.8  17.7
8    12.4 176.00  72.50 37.8  99.6  88.5  97.1  60.0 39.4  23.2   30.5   
29.0  18.8
9     4.1 191.00  74.00 38.1 100.9  82.5  99.9  62.9 38.3  23.8   35.9   
31.1  18.2
10   11.7 198.25  73.50 42.1  99.6  88.6 104.1  63.1 41.7  25.0   35.6   
30.0  19.2


and after standardizing it  . 

1   -0.831228836 -0.898881671 -0.98330178 -0.77420686 -0.952294055
-0.712961621 -0.814552365 -0.0625400993 -0.53901713 -0.825399059 -0.08244945
2   -1.588060506 -0.185928394  0.75868364  0.23560461 -0.889886435
-0.931523054 -0.155497233 -0.1252522485 -0.53901713  0.295114747 -0.59529632
3    0.755676279 -0.908262635 -1.56396359 -1.74011349 -0.615292906
-0.444727135 -0.077038289  0.0628841989  0.15515266  0.743320270 -1.17652277
4   -1.063161122  0.245595958  0.75868364 -0.24734870  0.133598535
-0.593746294  0.236797489  0.1674044475 -0.53901713 -0.153090775  0.05430971
5    1.170713001  0.226834030  0.37157577 -1.56449410 -0.428070046 
0.757360745  0.346640011  0.8154299886  1.58687786  0.743320270 -0.01406987
6    0.218569932  1.202454304  1.72645331  0.45512884  0.470599683 
0.201022552  1.272455554  1.4007433805  1.50010664  1.938534997  1.18257281
7    0.011051571  0.104881496 -0.20908604 -0.68639717  0.545488828
-0.166558039  0.095571389 -0.1879643976 -0.10516101 -0.078389855 -0.11663925
8   -0.819021874 -0.082737788  0.85546060 -0.07172932 -0.140994994
-0.385119472 -0.406565855  0.1465003978  0.37208072  0.145712907 -0.59529632
9   -1.832199755  0.480120063  1.43612241  0.05998522  0.021264819
-0.981196107  0.032804234  0.7527178395 -0.10516101  0.593918429  1.25095239
10  -0.904470611  0.752168024  1.24256848  1.81617909 -0.140994994
-0.375184861  0.691859366  0.7945259389  1.36994980  1.490329474  1.14838302



this is the result of applying PCA to the data matrix

Standard deviations:
 [1] 30.6645414  7.5513852  3.6927427  2.8703435  2.5363007  1.9136933 
1.5624131  1.3689630  1.2976189
[10]  1.1633458  1.1118231  0.7847148  0.4802303

Rotation:
            PC1         PC2         PC3          PC4          PC5         
PC6          PC7         PC8
var1  0.18110712 -0.74864138 -0.46070566 -0.365658769  0.192810075
-0.132529979  0.023764851  0.03674873
var2  0.86458284  0.34243386 -0.05766909 -0.235504989 -0.046075934 
0.001493006 -0.024535011  0.13439659
var3  0.03765598  0.20097537 -0.15709612 -0.343218776 -0.295201121
-0.073295697 -0.086930370 -0.54389141
var4    0.05965733  0.01737951  0.09854179 -0.030801791  0.125735684 
0.341795876 -0.001735808  0.37152696
var5   0.23845698 -0.20616399  0.68948870  0.025904812  0.391188182
-0.428933369 -0.101780281 -0.16965893
var6   0.29928369 -0.47394636  0.24791449  0.341235161 -0.511378719 
0.447071255 -0.077534385 -0.13198544
var7     0.19503685  0.01385823 -0.24126047  0.531403827 -0.127426510
-0.410568454  0.608163973 -0.01265457
var8   0.13261863  0.06839078 -0.37740589  0.535332339  0.366103479 
0.032376851 -0.574484605 -0.05645694
var9    0.06246705  0.04407384 -0.09545362  0.037993146 -0.036651080 
0.012347288 -0.192976142 -0.13027876
var10   0.03027791  0.05533988 -0.03749859 -0.009257423  0.011026593
-0.010770032 -0.104041067  0.12125263
var11  0.07435322  0.04334969 -0.02666944  0.032036374  0.464035624 
0.454970952  0.347507539 -0.60527541
var12 0.04328710  0.04731771  0.00360668 -0.054200633  0.275901346 
0.297800123  0.324323749  0.30487145
var13   0.02095652  0.02146485  0.03598618 -0.022510780  0.005192075 
0.103988977  0.031541374  0.07877455

               PC9         PC10         PC11        PC12         PC13
var1   -0.005328345  0.030549780 -0.049283616 -0.02211988  0.015660892
var2   0.170766596 -0.144031738  0.028862963  0.06984674  0.006293703
var3  -0.282549313  0.548650592  0.131284937 -0.14740722 -0.002384605
var4     0.024070488  0.614154008 -0.551480394 -0.03446124 -0.178123011
var5   -0.157551008  0.147685248  0.008044148 -0.04068258  0.007778992
var6   -0.058675551  0.006344813  0.130814072 -0.04088919 -0.028655330
var7     -0.099243751  0.171852216 -0.149231752 -0.06690208 -0.014693444
var8    0.006629025  0.199158097  0.187226774 -0.02511968  0.070896819
var9    -0.658214712 -0.320120384 -0.500003990  0.37630539 -0.023642902
var10   -0.259704149 -0.273030750 -0.074006053 -0.83676032 -0.348034215
var11   0.157450716 -0.148991117 -0.153561998 -0.08742543 -0.056513679
var12 -0.560837576  0.098418477  0.542670501  0.10593629 -0.007670188
var13   -0.110526479 -0.012776152 -0.165279275 -0.32037870  0.914832392




this is the result of applying PCA to the standardized data matrix

Standard deviations:
 [1] 2.9252556 1.1792994 0.8623322 0.7219158 0.6812740 0.5863879 0.4981330
0.4630637 0.4414004 0.4212403
[11] 0.2776168 0.2208503 0.1366760

Rotation:
            PC1          PC2         PC3           PC4         PC5        
PC6         PC7         PC8
var1   0.2214240 -0.528940022 -0.22438633 -0.0324934310  0.10237112
-0.47563754  0.33100129 -0.19102715
var2   0.3345528  0.023162612 -0.10713782 -0.0001760222  0.11352232 
0.04469088 -0.10098447  0.18643834
var3   0.1517554  0.605551504 -0.38237721  0.0314469316  0.59507576
-0.18321494  0.08116801  0.08111090
var4   0.2862444 -0.018344029  0.34874004 -0.1945368511  0.29590927 
0.30061030 -0.39160283 -0.20869249
var5   0.3027658 -0.244481933  0.03265146 -0.1559266926  0.12932226 
0.02393963 -0.16226550  0.45698236
var5   0.3005716 -0.329554056 -0.13879142 -0.1626911071  0.11072123
-0.05063054 -0.06388229  0.08496036
var6   0.3160710 -0.061820244 -0.23144824  0.1247108501 -0.06038088 
0.16065274 -0.18772748  0.07057902
var7   0.2973041  0.006421036 -0.17862551  0.3873606332 -0.28005086 
0.34119818 -0.13590921 -0.16267799
var8   0.2955016  0.144234590 -0.26323414 -0.0068912717 -0.18117677
-0.01771120  0.03379585 -0.62830066
var9   0.2552571  0.326437989 -0.09749610 -0.2291093560 -0.61898234
-0.22847105  0.01411768  0.38312210
var10   0.2822210  0.016911093  0.28838652  0.4287108516  0.07554337 
0.28403417  0.66673623  0.19445840
var11  0.2491444  0.135956228  0.53597029  0.3883062869 -0.01492335
-0.60228918 -0.26232244 -0.08966993
var12   0.2637809  0.185151550  0.33956904 -0.5971722620 -0.04476545 
0.08083909  0.34854493 -0.20909842


             PC9        PC10         PC11        PC12         PC13
var1 -0.40247469  0.05379733  0.063919267  0.26040567  0.015743241
var2   0.07150091  0.02906931 -0.009540692  0.02481489  0.899751898
var3  -0.11290113  0.06735920  0.100968481 -0.03902708 -0.182276335
var4  -0.52110479 -0.28262405 -0.150175234  0.06709027 -0.070349152
var5   0.36282385 -0.25907897  0.461043958  0.30566521 -0.256838644
var6   0.13245560  0.04742256 -0.174886071 -0.81057186 -0.147622115
var7   0.17950233  0.40472605 -0.602790052  0.38468466 -0.223865462
var8  -0.24062368  0.33426221  0.545545641 -0.12880676 -0.077404092
var9   0.37912190 -0.49731546 -0.023067506  0.04355862 -0.002718371
var10  -0.34729467 -0.21088629 -0.112243026 -0.03892369 -0.069031092
var11  -0.01252875 -0.22996539 -0.162156246 -0.04827985 -0.052013577
var12   0.14733228  0.12821614  0.009932520 -0.05164105 -0.025625894
var13   0.15194616  0.45367703  0.139390086  0.04590545 -0.004970894

 In this case  is it better to standardize the matrix or leave it as it is ? 
Also , how do I compare  which method gives the better result?
I also found that the proportion of the first principle after standardizing
it was reduce alot , would that mean that it is a bad idea to standardize
the matrix?
#
masterinex wrote:
Well, we try to understand you, but we do not either. I think you really 
nedc to consult some statistics textbook on PCA if my answer was not 
sufficient. Given your questions, I doubt you understand what PCA does 
and how it works. It does not predict anything.

Uwe Ligges
#
masterinex wrote:
I told you that you need to do the interpretation and it depends on the 
variables (are measured in the same units?). Nobody can give you an 
answer without knowledge about the variables.

And note, as Hadley mentioned before, that there are certainly 
statistical consultants available in your area, which is unknown for us 
given you post anonymously from some hotmail.com account - which is also 
not very helpful to get further answers....

Uwe Ligges
#
On Nov 22, 2009, at 10:22 AM, Uwe Ligges wrote:

            
It's likely to have been homework: A quick search on "masterinex" "xevilgang79" reveal which university this undergraduate student is at. It also produces a phone number, which can be used to lookup an address, and a cell phone number.

MK
#
Actually Its for an assignment Michael 
, all Im looking  is some help  and suggestions , please dont get it wrong ,
and I do believe that 
this is a helpful community .
It's likely to have been homework: A quick search on "masterinex"
"xevilgang79" reveal which university this undergraduate student is at. It
also produces a phone number, which can be used to lookup an address, and a
cell phone number.

MK
______________________________________________
R-help at r-project.org mailing list

PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.