Principal component analysis - R-help

Mon, Dec 9, 2002 3:39 AM #

Dear R users,

I'm trying to cluster 30 gene chips using principal component analysis in
package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is
probably just one of several methods to cluster the 30 chips. However, I
don't know how to run prcomp, and I don't know how to interpret it's output.

If there are 30 data points in 1,000 dimensions each, do I have to provide
the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?

x.HU.04h.Ctr.118.01.4.ctrl x.HU.04h.010.118.04.4.0.1
1                         21                        45
2                         24                        35
3                        109                       173
4                         86                        99
5                        130                       204
  x.HU.04h.050.118.05.4.0.5 x.HU.04h.100.118.06.4.1
x.HU.24h.Ctr.118.07.24.ctrl
1                        24                      28
22
2                        25                      25
20
3                       107                     125
95
4                        72                      79
61
5                       126                     166
128

1  2   3  4   5
x.HU.04h.Ctr.118.01.4.ctrl  21 24 109 86 130
x.HU.04h.010.118.04.4.0.1   45 35 173 99 204
x.HU.04h.050.118.05.4.0.5   24 25 107 72 126
x.HU.04h.100.118.06.4.1     28 25 125 79 166
x.HU.24h.Ctr.118.07.24.ctrl 22 20  95 61 128

there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be
1000 PCs, with the 1st PC beeing the most discriminativePC? In a principal
comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I
thought that PCA somehow collapses dimensionality ... . What is are PCs for
my 30 data points. Afterwards I'd also like to display the results in a
diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm
doing the right thing.

	I'm happy for any comments and explanations,

	kind regards,

	Arne

$x
                             
                                     PC1          PC2         PC3
PC4        PC5         PC6
  x.HU.04h.Ctr.118.01.4.ctrl  -1272.1203  -249.465634 -2185.20558
1083.15814  421.67755   100.26612
  x.HU.04h.010.118.04.4.0.1   -1493.8623  1483.260490 -1090.31102
-286.70562 1274.34804    37.88463
  x.HU.04h.050.118.05.4.0.5   -2688.5157  2055.336930   -83.70279
154.24116 1202.58763  -604.08124
  x.HU.04h.100.118.06.4.1     -2477.3271  2029.248507   -14.37922
-314.08755 1422.88800  -509.37791
  x.HU.24h.Ctr.118.07.24.ctrl -3198.7071 -2264.516725   209.04504
763.56664 -762.61481  -542.35302
  x.HU.24h.010.118.10.24.0.1  -3370.0556 -2190.205040   298.17498
702.80862 -783.48849  -509.22595
  x.HU.24h.050.118.11.24.0.5  -2662.8329 -1436.400955  1478.81635
129.83910  406.10451   337.88507
  x.HU.24h.100.118.12.24.1    -4193.3836 -1210.594052  1844.22923
914.84373  -11.33207    11.58916
  x.HU.04h.Ctr.206.13.4.ctrl   2305.5848  -180.584730 -2017.05340
1274.07436  132.14756   930.35799
  x.HU.04h.010.206.14.4.0.1    1703.4976  2032.883878   -78.67578
1697.50799 -301.93647   234.25139
  x.HU.04h.025.206.15.4.0.25   1294.1932  2876.862370   534.11002
1229.73355  -68.31220   226.47566
  x.HU.04h.050.206.16.4.0.5    3666.8441  3520.249397  1187.37289
-45.83772 -271.06706   145.75181
  x.HU.04h.100.206.17.4.1      3657.9687  3432.347857  1318.94834
-484.73817 -405.36077   349.88323
  x.HU.24h.Ctr.206.18.24.ctrl  5796.1801 -2985.085353 -1052.08033
-306.45667  265.22940  -732.59152
  x.HU.24h.010.206.19.24.0.1   4429.6809 -2685.801572 -1027.66157
822.76848  171.15959 -1118.12987
  x.HU.24h.025.206.20.24.0.25  5672.4279 -1559.896071  1177.74742
-734.37026  336.46183  -132.25625
  x.HU.24h.050.206.21.24.0.5   4855.8534  -809.112994  1825.99459
-594.09109  190.00907  -234.33254
  x.HU.24h.100.206.22.24.1     4015.2594  -166.349964  1015.96643
622.86202 -267.17075   400.45741
  x.HU.04h.Ctr.821.23.4.ctrl   -485.9779    91.410337 -2446.35100
-263.83351 -453.89005   491.14145
  x.HU.04h.Ctr.821.24.4.ctrl    390.5580    -8.264721 -2707.56580
-1265.35762 -156.67885   555.41157
  x.HU.04h.010.821.25.4.0.1   -1138.4096  1733.090222  -885.89460
-460.04065 -276.68619  -200.20132
  x.HU.04h.025.821.26.4.0.25  -1622.0565  2333.333749  -297.50664
-838.12742 -783.19740  -206.76327
  x.HU.04h.050.821.27.4.0.5   -1920.9992  2462.596326  -213.80507
-463.02219 -683.90138  -731.04753
  x.HU.04h.100.821.28.4.1     -2288.0687  2251.971783   223.28215
-472.78173 -668.16917  -623.88411
  x.HU.24h.Ctr.821.29.24.ctrl  -599.7405 -2105.800732  -792.89966
-902.43731 -158.37800   314.34868
  x.HU.24h.Ctr.821.30.24.ctrl  -743.5533 -2154.937309  -350.37118
-744.69040 -479.01087   172.03340
  x.HU.24h.010.821.31.24.0.1  -2240.3848 -1963.626249   306.05426
-178.59331 -166.16473   266.24216
  x.HU.24h.025.821.32.24.0.25 -1840.1627 -1667.075636  1271.79029
-333.21614 -178.28014   477.06373
  x.HU.24h.050.821.33.24.0.5  -1575.7248 -1431.615872  1059.90748
-531.84286  537.76332   502.46140
  x.HU.24h.100.821.34.24.1    -1976.1656 -1233.258236  1492.02417
-175.17357  515.26288   590.73966

[...]

Jonathan Baron

Mon, Dec 9, 2002 3:50 AM #

On 12/09/02 11:38, Arne.Muller at aventis.com wrote:

PCA is almost certainly not what you want.  Kmeans might work (or
other functions designed for clustering).

The reason your output is limited to 30 components is (roughly)
that, once you have this many, all the other 970 are predictable
from these, because you have only 30 observations.

Jonathan Baron, Professor of Psychology, University of Pennsylvania
R page:               http://finzi.psych.upenn.edu/

Brian Ripley

Mon, Dec 9, 2002 4:15 AM #

On Mon, 9 Dec 2002 Arne.Muller at aventis.com wrote:

None of those. A 30x1000 matrix.

No.  970 of them span the null space: you have massive over-fitting.

Well, statistically neither am I.  But mathematically at least, the PCs
for your 30 data points are the `x' component of the result, and you can
plot them via

plot(pca$x[1:2])

in two dimensions, or use scatterplot3d (a package) or (preferably as it
is dynamic) the ggobi or xgobi interfaces in 3D.

This sort of thing *is* covered in many of the texts about S (or S-PLUS or
R).

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595