Skip to content

Error in princomp?

1 message · Jari Oksanen

#
tlumley at u.washington.edu said:
A more intriguing feature in R is that its two alternative PCAs in mva differ 
slightly and report slightly different square roots of eigenvalues:

(here I use my own data set, but the result are similar with any you have:)
Comp.1    Comp.2    Comp.3    Comp.4    Comp.5 
30.692367 21.094028 11.257890  8.417428  6.811818
[1] 31.352493 21.547715 11.500023  8.598469  6.958325

The reason is, of course, that `prcomp' opts for unbiased variance whereas 
`princomp' works hard to get biased variances, but of course we can turn that 
off:
Comp.1    Comp.2    Comp.3    Comp.4    Comp.5 
31.352493 21.547715 11.500023  8.598469  6.958325 

The practise in prcomp seems to be similar to the normal one, so that its 
eigenvalues add up to variance:
[1] 1825.659
[1] 1749.590
[1] 1825.659

However, princomp seems to work hard to get biased estimates of covariance:

 cv <- covmat$cov * (1 - 1/n.obs) # in princomp.default
(here covmat$cov is the matrix of unbiased covariances)

whereas it doesn't do the same if analysis is based on correlations:

    if (cor) { # in princomp.default
        sds <- sqrt(diag(cv))
        cv <- cv/(sds %o% sds)
    }

Whereas prcomp is just as explicit in opting for unbiased variances, although it
has to do this after extracting the singular values (s$d) which otherwise would
be related to sums of squares instead of variances:

    s$d <- s$d/sqrt(max(1, nrow(x) - 1))

Personally, I'd opt for the prcomp alternative (which is the official 
recommendation anyhow).

Except for sign, the PC solutions are indeterminate for scale (scores at least,
the loadings are habitually orthonormal == rotation matrix). It seems that R
opts for slight inconsistency, though. Square root of eigenvalues are geared for
variances, but the scores for sums of squares.
[1] 1825.659
[1] 41990.17
[1] 1825.659

The current scaling is just as correct as the alternatives, of course, but not 
concordant with the square root of eigenvalues: eigenvalues are divided by df 
after extraction, but scores are not. This can be turned of by dividing the 
scores by sqrt(nrow(x)-1) -- which of course is a constant for any matrix x.

Peter Vikstr?m's message alone didn't make me to study PCA output in R this
closely, but I just confronted a really weird scaling in another software and
trying to understand that one I looked at R as well. I am just implementing Cajo
ter Braak's "Redundancy Analysis" for ecological community data in R (a spin-off
from his "Canonical Correspondence Analysis"), but I couldn't get similar
species scores as in his software Canoco, although my results were similar to
PCAs in R. Canoco works like it would use covariance matrix or svd of centred
data -- it does neither, but something more obscure, but it looks like this from
the surface. However, the species scores (which are relative to eigenvalues:
site scores are orthonormal) are divided by the standard deviation of each
species. So they are the square roots of ``variance explained by axis'' for each
species, or look like coming from correlation matrix although the analysis is
based on covariance matrix. I won't implement that behaviour, but that's 
an interesting alternative.

This message really was a digression...

cheers, jari oksanen