I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
How to plot PCA output?
14 messages · Christian Cole, Jessica Streicher, Bryan Hanson +2 more
That depends on what you want to plot there. Basically, you could just use plot() with pcaResult$x. You might need to define which PCs you want to plot there though. pcaResult<-prcomp(iris[,1:4]) plot(pcaResult$x) # gives the first 2 PCs plot(pcaResult$x[,2:3]) #gives the second vs the 3rd PC or if you want to see more you can use pairs() pairs(pcaResult$x) if you want things colored, theres the col parameter that works for both functions: pairs(pcaResult$x,col=iris[,5]) Does this help? Am 07.05.2012 um 12:22 schrieb Christian Cole:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
To add: If thats not it, maybe you could be a bit more specific about what you consider the "result", and how you want it visualized. Am 07.05.2012 um 15:24 schrieb Jessica Streicher:
That depends on what you want to plot there. Basically, you could just use plot() with pcaResult$x. You might need to define which PCs you want to plot there though. pcaResult<-prcomp(iris[,1:4]) plot(pcaResult$x) # gives the first 2 PCs plot(pcaResult$x[,2:3]) #gives the second vs the 3rd PC or if you want to see more you can use pairs() pairs(pcaResult$x) if you want things colored, theres the col parameter that works for both functions: pairs(pcaResult$x,col=iris[,5]) Does this help? Am 07.05.2012 um 12:22 schrieb Christian Cole:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Christian, is that 36 samples x 11K variables? Sounds like it. Is this spectroscopic data? In any case, the scores are in the list element $x as follows: answer <- prcomp(your matrix) answer$x contains the scores, so if you want to plot the 1st 2 pcs, you could do plot(answer$x[,1], answer$x[,2]) Because the columns of answer$x contain the scores of the PCs in order. [I see Jessica just answered...] If you want the loading plot, it's going to be interesting with all those variables, but this will do it: plot(1:11000, answer$rotation[,1], type = "l") # for the loadings of the 1st PC Depending upon what kind of data this is, the 1:11000 could be replaced by something more sensible. If it is spectroscopic data, then replace it with your frequency values. By the way, plot(answer) will give you the scree plot to determine how many PCs are worthy. Good luck. Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry DePauw University
On May 7, 2012, at 6:22 AM, Christian Cole wrote:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Bryan, Many thanks for the replies. The data is gene expression data for 36 samples over 11k genes. I see that I can plot PC1 vs PC2 by using $x, but compared to biplot() I can see that the range of values are different. For example, if I use plot() the PC1 scale ranges from -150 to 150 whereas in biplot() it scales from -0.4 to 0.4. Do you know what scaling biplot() uses? Does it even matter? Cheers, Chris
On 07/05/2012 14:36, "Bryan Hanson" <hanson at depauw.edu> wrote:
Christian, is that 36 samples x 11K variables? Sounds like it. Is this spectroscopic data? In any case, the scores are in the list element $x as follows: answer <- prcomp(your matrix) answer$x contains the scores, so if you want to plot the 1st 2 pcs, you could do plot(answer$x[,1], answer$x[,2]) Because the columns of answer$x contain the scores of the PCs in order. [I see Jessica just answered...] If you want the loading plot, it's going to be interesting with all those variables, but this will do it: plot(1:11000, answer$rotation[,1], type = "l") # for the loadings of the 1st PC Depending upon what kind of data this is, the 1:11000 could be replaced by something more sensible. If it is spectroscopic data, then replace it with your frequency values. By the way, plot(answer) will give you the scree plot to determine how many PCs are worthy. Good luck. Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry DePauw University On May 7, 2012, at 6:22 AM, Christian Cole wrote:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
Hi Jessica, Yes, that does help. It confirms my digging around in the prcomp object. I was plotting $x, but wasn't sure whether this was appropriate. Mainly because the data ranges are different in $x than when plotted by biplot() - as I mentioned my reply to Bryan. Do you know if this difference is data range matters? Many thanks, Chris
On 07/05/2012 14:24, "Jessica Streicher" <j.streicher at micromata.de> wrote:
That depends on what you want to plot there. Basically, you could just use plot() with pcaResult$x. You might need to define which PCs you want to plot there though. pcaResult<-prcomp(iris[,1:4]) plot(pcaResult$x) # gives the first 2 PCs plot(pcaResult$x[,2:3]) #gives the second vs the 3rd PC or if you want to see more you can use pairs() pairs(pcaResult$x) if you want things colored, theres the col parameter that works for both functions: pairs(pcaResult$x,col=iris[,5]) Does this help? Am 07.05.2012 um 12:22 schrieb Christian Cole:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120507/3fe023ff/attachment.pl>
And i always forget the question.. I haven't understood biplots a 100%, but from what i gleaned this scaling is done so it looks better/is easier to read, while the scaling retains certain properties of the biplot (something about projecting). If you want to use the data for anything else, i wouldn't use that scaling, just use what the prcomp() or princomp() function returns to you. Am 07.05.2012 um 16:11 schrieb Jessica Streicher:
Biplot, depending on what parameters you give it, scales the data in a certain way. See http://stat.ethz.ch/R-manual/R-patched/library/stats/html/biplot.princomp.html scale The variables are scaled by lambda ^ scale and the observations are scaled by lambda ^ (1-scale) where lambda are the singular values as computed by princomp. Normally 0 <= scale <= 1, and a warning will be issued if the specified scale is outside this range. Am 07.05.2012 um 16:01 schrieb Christian Cole:
Hi Jessica, Yes, that does help. It confirms my digging around in the prcomp object. I was plotting $x, but wasn't sure whether this was appropriate. Mainly because the data ranges are different in $x than when plotted by biplot() - as I mentioned my reply to Bryan. Do you know if this difference is data range matters? Many thanks, Chris On 07/05/2012 14:24, "Jessica Streicher" <j.streicher at micromata.de> wrote:
That depends on what you want to plot there. Basically, you could just use plot() with pcaResult$x. You might need to define which PCs you want to plot there though. pcaResult<-prcomp(iris[,1:4]) plot(pcaResult$x) # gives the first 2 PCs plot(pcaResult$x[,2:3]) #gives the second vs the 3rd PC or if you want to see more you can use pairs() pairs(pcaResult$x) if you want things colored, theres the col parameter that works for both functions: pairs(pcaResult$x,col=iris[,5]) Does this help? Am 07.05.2012 um 12:22 schrieb Christian Cole:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I don't know the answer, Jessica gave some insight. I avoid the biplot at all costs, because IMHO it violates one of the tenets of good graphic design: It has two entirely different scales on axes. These are maximally confusing to the end-user. So I never use it. If it is gene expression data, have you looked in Bioconductor for something that will help you? Maybe runPCA in package EMA? Bryan
On May 7, 2012, at 9:57 AM, Christian Cole wrote:
Hi Bryan, Many thanks for the replies. The data is gene expression data for 36 samples over 11k genes. I see that I can plot PC1 vs PC2 by using $x, but compared to biplot() I can see that the range of values are different. For example, if I use plot() the PC1 scale ranges from -150 to 150 whereas in biplot() it scales from -0.4 to 0.4. Do you know what scaling biplot() uses? Does it even matter? Cheers, Chris On 07/05/2012 14:36, "Bryan Hanson" <hanson at depauw.edu> wrote:
Christian, is that 36 samples x 11K variables? Sounds like it. Is this spectroscopic data? In any case, the scores are in the list element $x as follows: answer <- prcomp(your matrix) answer$x contains the scores, so if you want to plot the 1st 2 pcs, you could do plot(answer$x[,1], answer$x[,2]) Because the columns of answer$x contain the scores of the PCs in order. [I see Jessica just answered...] If you want the loading plot, it's going to be interesting with all those variables, but this will do it: plot(1:11000, answer$rotation[,1], type = "l") # for the loadings of the 1st PC Depending upon what kind of data this is, the 1:11000 could be replaced by something more sensible. If it is spectroscopic data, then replace it with your frequency values. By the way, plot(answer) will give you the scree plot to determine how many PCs are worthy. Good luck. Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry DePauw University On May 7, 2012, at 6:22 AM, Christian Cole wrote:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
Hi Jessica, THanks for pointing that out. The scaling in biplot() doesn't seem to make sense to me, however. The default value for scale=1 therefore lambda ^ (1-scale) -> lambda ^ 0 which is 1 regardless of what lambda is. Which can't be right? Anyway, I won't worry about it anymore as you and Bryan have confirmed that I am doing the right thing by plotting the $x data and will ignore biplot(). Many thanks, Chris
On 07/05/2012 15:25, "Jessica Streicher" <j.streicher at micromata.de> wrote:
And i always forget the question.. I haven't understood biplots a 100%, but from what i gleaned this scaling is done so it looks better/is easier to read, while the scaling retains certain properties of the biplot (something about projecting). If you want to use the data for anything else, i wouldn't use that scaling, just use what the prcomp() or princomp() function returns to you. Am 07.05.2012 um 16:11 schrieb Jessica Streicher:
Biplot, depending on what parameters you give it, scales the data in a certain way. See http://stat.ethz.ch/R-manual/R-patched/library/stats/html/biplot.princomp .html scale The variables are scaled by lambda ^ scale and the observations are scaled by lambda ^ (1-scale) where lambda are the singular values as computed by princomp. Normally 0 <= scale <= 1, and a warning will be issued if the specified scale is outside this range. Am 07.05.2012 um 16:01 schrieb Christian Cole:
Hi Jessica, Yes, that does help. It confirms my digging around in the prcomp object. I was plotting $x, but wasn't sure whether this was appropriate. Mainly because the data ranges are different in $x than when plotted by biplot() - as I mentioned my reply to Bryan. Do you know if this difference is data range matters? Many thanks, Chris On 07/05/2012 14:24, "Jessica Streicher" <j.streicher at micromata.de> wrote:
That depends on what you want to plot there. Basically, you could just use plot() with pcaResult$x. You might need to define which PCs you want to plot there though. pcaResult<-prcomp(iris[,1:4]) plot(pcaResult$x) # gives the first 2 PCs plot(pcaResult$x[,2:3]) #gives the second vs the 3rd PC or if you want to see more you can use pairs() pairs(pcaResult$x) if you want things colored, theres the col parameter that works for both functions: pairs(pcaResult$x,col=iris[,5]) Does this help? Am 07.05.2012 um 12:22 schrieb Christian Cole:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
Hi Bryan,
On 07/05/2012 15:33, "Bryan Hanson" <hanson at depauw.edu> wrote:
I don't know the answer, Jessica gave some insight. I avoid the biplot at all costs, because IMHO it violates one of the tenets of good graphic design: It has two entirely different scales on axes. These are maximally confusing to the end-user. So I never use it.
I couldn't agree more :)
If it is gene expression data, have you looked in Bioconductor for something that will help you? Maybe runPCA in package EMA?
I may do, but you've answered my question and I've got a PCA plot that works. Many thanks, Chris
Bryan On May 7, 2012, at 9:57 AM, Christian Cole wrote:
Hi Bryan, Many thanks for the replies. The data is gene expression data for 36 samples over 11k genes. I see that I can plot PC1 vs PC2 by using $x, but compared to biplot() I can see that the range of values are different. For example, if I use plot() the PC1 scale ranges from -150 to 150 whereas in biplot() it scales from -0.4 to 0.4. Do you know what scaling biplot() uses? Does it even matter? Cheers, Chris On 07/05/2012 14:36, "Bryan Hanson" <hanson at depauw.edu> wrote:
Christian, is that 36 samples x 11K variables? Sounds like it. Is this spectroscopic data? In any case, the scores are in the list element $x as follows: answer <- prcomp(your matrix) answer$x contains the scores, so if you want to plot the 1st 2 pcs, you could do plot(answer$x[,1], answer$x[,2]) Because the columns of answer$x contain the scores of the PCs in order. [I see Jessica just answered...] If you want the loading plot, it's going to be interesting with all those variables, but this will do it: plot(1:11000, answer$rotation[,1], type = "l") # for the loadings of the 1st PC Depending upon what kind of data this is, the 1:11000 could be replaced by something more sensible. If it is spectroscopic data, then replace it with your frequency values. By the way, plot(answer) will give you the scree plot to determine how many PCs are worthy. Good luck. Bryan *********** Bryan Hanson Professor of Chemistry & Biochemistry DePauw University On May 7, 2012, at 6:22 AM, Christian Cole wrote:
I have a decent sized matrix (36 x 11,000) that I have preformed a PCA on with prcomp(), but due to the large number of variables I can't plot the result with biplot(). How else can I plot the PCA output? I tried posting this before, but got no responses so I'm trying again. Surely this is a common problem, but I can't find a solution with google? The University of Dundee is a registered Scottish Charity, No: SC015096
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
The University of Dundee is a registered Scottish Charity, No: SC015096
The University of Dundee is a registered Scottish Charity, No: SC015096
-----Original Message----- I avoid the biplot at all costs, because IMHO it violates one of the tenets of good graphic design: It has two entirely different scales on axes. These are maximally confusing to the end-user. So I never use it.
I think you're being unnecessarily restrictive there. The confusion that arises when using multiple scales in the same graphical dimension arises from a tendency to read distances and locations on the wrong scale. In a biplot, the PC's have essentially no intuitive physical interpretation (by which I mean a 1:1 mapping onto an identifiable variable) so this doesn't matter much even if it happens (in fact you cold probably lose the scales entirely in a biplot without compromising its interpretation much). And the alternative - sticking rigidly to the 'one axis per dimension' rule and to plot them with the _same_ scales - often leads to unreadable plots: invisibly tiny arrows or an invisibly tiny cloud of data points.
But having indicated that I don't see a biplot's multiple scales as particularly likely to confuse or mislead, I'm always interested in alternatives. The interesting question is 'given the same objective - a qualitative indication of which variables have most influenced the location of particular data points (or vice versa) and in which general direction - what do you suggest instead?'
Steve Ellison
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
I think the question on your mind should be: 'what do I want to do with this plot'? Just producing output from the PCA is easy - plotting the output$sd is probably quite informative. From the sounds of it, though, you want to do clustering with the PCA component loadings? (Since that's mostly what the biplot accomplishes using the first two PCs.) The first thing to note, then is that you might not want to plot all 36 PCs, then! Once you go higher than the first few, your results will likely become remarkably awful in ways that might not be obvious. A biplot with PCs 1 & 2, or 2 & 3, for example, could be easily sufficient. If you want to still plot many PCs, from an exploratory point of view, something like a parallel coordinates plot might be helpful. Alternatively, you could look at rgl for general plotting of 3d points (so you can do a 3d version of the biplot), or apply more systematic clustering algorithms. Zhou -- View this message in context: http://r.789695.n4.nabble.com/How-to-plot-PCA-output-tp4614732p4617165.html Sent from the R help mailing list archive at Nabble.com.
[...]
But having indicated that I don't see a biplot's multiple scales as particularly likely to confuse or mislead, I'm always interested in alternatives. The interesting question is 'given the same objective - a qualitative indication of which variables have most influenced the location of particular data points (or vice versa) and in which general direction - what do you suggest instead?' Steve Ellison
Steve is probably looking for answers from others, but if the variables are relatively few, I plot the loadings vs variables for each of the first few PCs, using something like a bar plot which can go positive or negative (not the best ink to data ratio however). So PC1 loadings vs variable names, PC2 loadings vs variable names etc.
If there are a lot of variables, I use a dot rather than a bar, or more generally, a line (for instance, spectroscopic data where there are PC1 loadings vs thousands of frequencies).
The magnitude and sign of the loading for each variable gives you a sense of the contribution of that variable to the given PC.
I suspect this is not what Steve had in mind (he no doubt knows these things well already) but I'm also always on the lookout for good displays. Share 'em if you got 'em.
Bryan
i.pca <- prcomp(iris[,1:4])
library("ggplot2")
# plot scores
scores <- as.data.frame(i.pca$x)
qplot(x = PC1, y = PC2, data = scores, geom = "point", col = iris[,5])
# Loadings on PC1 (few variables)
loadings <- as.data.frame(i.pca$rotation)
loadings$var <- colnames(iris[,1:4])
qplot(x = var, y = PC1, data = loadings, geom = "bar")
# Could also use geom = "point" but when there are many variables you may wish to connect the points too.
# Compare to
biplot(i.pca)
And you can see the biplot has some additional information compared to the simple loading plot, but I'd have to dig out exactly what it is and if it is especially useful.