Dear R users,
I'm trying to cluster 30 gene chips using principal component analysis in
package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is
probably just one of several methods to cluster the 30 chips. However, I
don't know how to run prcomp, and I don't know how to interpret it's output.
If there are 30 data points in 1,000 dimensions each, do I have to provide
the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?
there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be
1000 PCs, with the 1st PC beeing the most discriminativePC? In a principal
comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I
thought that PCA somehow collapses dimensionality ... . What is are PCs for
my 30 data points. Afterwards I'd also like to display the results in a
diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm
doing the right thing.
I'm happy for any comments and explanations,
kind regards,
Arne
On 12/09/02 11:38, Arne.Muller at aventis.com wrote:
Dear R users,
I'm trying to cluster 30 gene chips using principal component analysis in
package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is
probably just one of several methods to cluster the 30 chips. However, I
don't know how to run prcomp, and I don't know how to interpret it's output.
PCA is almost certainly not what you want. Kmeans might work (or
other functions designed for clustering).
The reason your output is limited to 30 components is (roughly)
that, once you have this many, all the other 970 are predictable
from these, because you have only 30 observations.
On Mon, 9 Dec 2002 Arne.Muller at aventis.com wrote:
Dear R users,
I'm trying to cluster 30 gene chips using principal component analysis in
package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is
probably just one of several methods to cluster the 30 chips. However, I
don't know how to run prcomp, and I don't know how to interpret it's output.
If there are 30 data points in 1,000 dimensions each, do I have to provide
the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?
there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be
1000 PCs, with the 1st PC beeing the most discriminativePC? In a principal
No. 970 of them span the null space: you have massive over-fitting.
comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I
thought that PCA somehow collapses dimensionality ... . What is are PCs for
my 30 data points. Afterwards I'd also like to display the results in a
diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm
doing the right thing.
Well, statistically neither am I. But mathematically at least, the PCs
for your 30 data points are the `x' component of the result, and you can
plot them via
plot(pca$x[1:2])
in two dimensions, or use scatterplot3d (a package) or (preferably as it
is dynamic) the ggobi or xgobi interfaces in 3D.
This sort of thing *is* covered in many of the texts about S (or S-PLUS or
R).
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595