bootstrapped eigenvector method following prcomp - R-help

Mon, Jan 19, 2009 12:53 AM #

G'Day R users!

Following an ordination using prcomp, I'd like to test which variables 
singnificantly contribute to a principal component. There is a method 
suggested by Peres-Neto and al. 2003. Ecology 84:2347-2363  called 
"bootstrapped eigenvector". It was asked for that in this forum in 
January 2005 by J?r?me Lema?tre:
"1) Resample 1000 times with replacement entire raws from the original 
data sets []
2) Conduct a PCA on each bootstrapped sample
3) To prevent axis reflexion and/or axis reordering in the bootstrap, 
here are two more steps for each bootstrapped sample
3a) calculate correlation matrix between the PCA scores of the original 
and those of the bootstrapped sample
3b) Examine whether the highest absolute correlation is between the 
corresponding axis for the original and bootstrapped samples. When it is 
not the case, reorder the eigenvectors. This means that if the highest 
correlation is between the first original axis and the second 
bootstrapped axis, the loadings for the second bootstrapped axis and use 
to estimate the confidence interval for the original first PC axis.
4) Determine the p value for each loading. Obtained as follow: number of 
loadings >=0 for loadings that were positive in the original matrix 
divided by the number of boostrap samples (1000) and/or number of 
loadings =<0 for loadings that were negative in the original matrix 
divided by the number of boostrap samples (1000)."

(see https://stat.ethz.ch/pipermail/r-help/2005-January/065139.html ).

The suggested solution (by Jari Oksanen) was


function (x, permutations=1000, ...)
{
    pcnull <- princomp(x, ...)
    res <- pcnull$loadings
    out <- matrix(0, nrow=nrow(res), ncol=ncol(res))
    N <- nrow(x)
    for (i in 1:permutations) {
        pc <- princomp(x[sample(N, replace=TRUE), ], ...)
        pred <- predict(pc, newdata = x)
        r <-  cor(pcnull$scores, pred)
        k <- apply(abs(r), 2, which.max)
        reve <- sign(diag(r[k,]))
        sol <- pc$loadings[ ,k]
        sol <- sweep(sol, 2, reve, "*")
        out <- out + ifelse(res > 0, sol <=  0, sol >= 0)
    }
    out/permutations
}

However, in a post from March 2005 ( http://r-help.com/msg/6125.html ) 
Jari himself mentioned that there is a bug in this method.

I was wondering whether someone could tell me where the bug is or 
whether there is a better method in R to test for significance of 
loadings (not the significance of the PCs).
Maybe it is not a good idea to do it at all, but I would prefer to have 
some guidline for interpretation rather than making decisions 
arbitrarily. I tried to look everywhere before posting here.

I would be very thankful for any help,

Axel

Gravity is a habit that is hard to shake off.
Terry Pratchett

Axel Strauß

Mon, Jan 19, 2009 1:02 AM #

Axel Strau? schrieb:

Sorry, correction. I mean "using princomp".

Gravity is a habit that is hard to shake off.
Terry Pratchett

Gavin Simpson

Mon, Jan 19, 2009 1:42 AM #

Hi Alex,

I presume you've asked Jari what the bug is? He obviously knows and you
are asking a lot for the rest of us to i) know the method you speak of
and ii) debug the code of another.

As an alternative, try function testdim() in the ade4 package. This
implements the method of Dray:

Dray, S (2008) On the number of principal components: A test of
dimensionality based on measurements of similarity between matrices.
Computational Statistics and Data Analysis, 58:2228:2237.

G

On Mon, 2009-01-19 at 09:53 +0100, Axel Strau? wrote:

G'Day R users!

Following an ordination using prcomp, I'd like to test which variables 
singnificantly contribute to a principal component. There is a method 
suggested by Peres-Neto and al. 2003. Ecology 84:2347-2363  called 
"bootstrapped eigenvector". It was asked for that in this forum in 
January 2005 by J?r?me Lema?tre:
"1) Resample 1000 times with replacement entire raws from the original 
data sets []
2) Conduct a PCA on each bootstrapped sample
3) To prevent axis reflexion and/or axis reordering in the bootstrap, 
here are two more steps for each bootstrapped sample
3a) calculate correlation matrix between the PCA scores of the original 
and those of the bootstrapped sample
3b) Examine whether the highest absolute correlation is between the 
corresponding axis for the original and bootstrapped samples. When it is 
not the case, reorder the eigenvectors. This means that if the highest 
correlation is between the first original axis and the second 
bootstrapped axis, the loadings for the second bootstrapped axis and use 
to estimate the confidence interval for the original first PC axis.
4) Determine the p value for each loading. Obtained as follow: number of 
loadings >=0 for loadings that were positive in the original matrix 
divided by the number of boostrap samples (1000) and/or number of 
loadings =<0 for loadings that were negative in the original matrix 
divided by the number of boostrap samples (1000)."

(see https://stat.ethz.ch/pipermail/r-help/2005-January/065139.html ).

The suggested solution (by Jari Oksanen) was


function (x, permutations=1000, ...)
{
    pcnull <- princomp(x, ...)
    res <- pcnull$loadings
    out <- matrix(0, nrow=nrow(res), ncol=ncol(res))
    N <- nrow(x)
    for (i in 1:permutations) {
        pc <- princomp(x[sample(N, replace=TRUE), ], ...)
        pred <- predict(pc, newdata = x)
        r <-  cor(pcnull$scores, pred)
        k <- apply(abs(r), 2, which.max)
        reve <- sign(diag(r[k,]))
        sol <- pc$loadings[ ,k]
        sol <- sweep(sol, 2, reve, "*")
        out <- out + ifelse(res > 0, sol <=  0, sol >= 0)
    }
    out/permutations
}

However, in a post from March 2005 ( http://r-help.com/msg/6125.html ) 
Jari himself mentioned that there is a bug in this method.

I was wondering whether someone could tell me where the bug is or 
whether there is a better method in R to test for significance of 
loadings (not the significance of the PCs).
Maybe it is not a good idea to do it at all, but I would prefer to have 
some guidline for interpretation rather than making decisions 
arbitrarily. I tried to look everywhere before posting here.

I would be very thankful for any help,

Axel

%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090119/745bc1fe/attachment-0002.bin>

Stas Kolenikov

Mon, Jan 19, 2009 8:49 AM #

I don't know if there are bugs in the code, but the step 4) does not
compute significance... at least the way statisticians know it. The
fractions above or below 0 are not significance. I don't even know how
to call those... probably cdf of the bootstrap distribution evaluated
at zero.

Let's put the flies and the sausages separately, as a Russian saying
goes. If you bootstrap from your original data, you can get the
confidence intervals, but not the test statistics. What you can do
with your bootstrap from the raw data is accumulate the distribution
of the eigenvectors, along the lines of (assuming you are testing for
the significance of variables in your first component):

function (x, permutations=1000, ...)
{
  pcnull <- princomp(x, ... )
  res <- pcnull$loadings[,1]
  bsresults = matrix( rep.int(NA, permutations*NROW(res)) ,
nrow=permutations, ncol=NROW(res) )
  N <- nrow(x)
  for (i in 1:permutations) {
      pc <- princomp(x[sample(N, replace=TRUE), ], ... )
      pred <- predict(pc, newdata = x)
      r <-  cor(pcnull$scores, pred)
      k <- apply(abs(r), 2, which.max)
      reve <- sign(diag(r[k,]))
      sol <- pc$loadings[ ,k]
      sol <- sweep(sol, 2, reve, "*")
      bsresults[i,] <- t(sol[,1])
  }
  apply( bsresults, 2, quantile, c(0.05, 0.95) )
}

if I am not messing up the dimensions and other stuff too much.
However as a result you will get an approximately degenerate
distribution sitting on the unit sphere since the eigenvectors are
always normalized to have the length of 1. You can still do marginal
confidence intervals with that though, and see if 0 is covered.

The main problem here is I am not entirely sure the bootstrap is
applicable for the problem at hand. In other words, it is not clear
whether the bootstrap estimates are consistent for the true
variability. Eigenproblems are quite difficult and prone to
non-regularities (the above mentioned degeneracy is just one of them,
and probably not the worst one). There are different asymptotics
(Anderson's of fixed p and n \to \infty,
http://www.citeulike.org/user/ctacmo/article/3908837, and Johnstone
joint p and n \to \infty with p/n \to const,
http://www.citeulike.org/user/ctacmo/article/3908846), the
distributions are often non-standard, while the bootstrap works well
when the distributions of the estimates are normal. When you have
something weird, bootstrap may easily break down, and in a lot of
other situations, you need to come up with special schemes. See my
oh-so-favorite paper on the bootstrap,
http://www.citeulike.org/user/ctacmo/article/575126.

One of those special schemes (back to out muttons, or rather flies and
sausages) -- to set up the bootstrap for hypothesis testing and get
the p-values, you need to bootstrap from the distribution that
corresponds to the null. Beran and Srivastava (1985 Annals,
http://www.citeulike.org/user/ctacmo/article/3015345) discuss how to
rotate your data to conform to the null hypothesis of interest for
inference with covariance matrices and their functions (such as
eigenvalues, for instance). Whether you need to go into all this
trouble, I don't really know.

If you have an inferential problem of testing whether a particular
variable contributes to the overall index, and have a pretty good idea
of where each variable goes, may be you need to shift your paradigm
and look at confirmatory factor analysis models instead, estimable in
R with John Fox' sem package.

On 1/19/09, Axel Strau? <a.strauss at tu-bs.de> wrote:

G'Day R users!

 Following an ordination using prcomp, I'd like to test which variables
singnificantly contribute to a principal component. There is a method
suggested by Peres-Neto and al. 2003. Ecology 84:2347-2363  called
"bootstrapped eigenvector". It was asked for that in this forum in January
2005 by J?r?me Lema?tre:
 "1) Resample 1000 times with replacement entire raws from the original data
sets []
 2) Conduct a PCA on each bootstrapped sample
 3) To prevent axis reflexion and/or axis reordering in the bootstrap, here
are two more steps for each bootstrapped sample
 3a) calculate correlation matrix between the PCA scores of the original and
those of the bootstrapped sample
 3b) Examine whether the highest absolute correlation is between the
corresponding axis for the original and bootstrapped samples. When it is not
the case, reorder the eigenvectors. This means that if the highest
correlation is between the first original axis and the second bootstrapped
axis, the loadings for the second bootstrapped axis and use to estimate the
confidence interval for the original first PC axis.
 4) Determine the p value for each loading. Obtained as follow: number of
loadings >=0 for loadings that were positive in the original matrix divided
by the number of boostrap samples (1000) and/or number of loadings =<0 for
loadings that were negative in the original matrix divided by the number of
boostrap samples (1000)."

 (see
https://stat.ethz.ch/pipermail/r-help/2005-January/065139.html
).

 The suggested solution (by Jari Oksanen) was


 function (x, permutations=1000, ...)
 {
   pcnull <- princomp(x, ...)
   res <- pcnull$loadings
   out <- matrix(0, nrow=nrow(res), ncol=ncol(res))
   N <- nrow(x)
   for (i in 1:permutations) {
       pc <- princomp(x[sample(N, replace=TRUE), ], ...)
       pred <- predict(pc, newdata = x)
       r <-  cor(pcnull$scores, pred)
       k <- apply(abs(r), 2, which.max)
       reve <- sign(diag(r[k,]))
       sol <- pc$loadings[ ,k]
       sol <- sweep(sol, 2, reve, "*")
       out <- out + ifelse(res > 0, sol <=  0, sol >= 0)
   }
   out/permutations
 }

 However, in a post from March 2005 (
http://r-help.com/msg/6125.html ) Jari himself mentioned
that there is a bug in this method.

 I was wondering whether someone could tell me where the bug is or whether
there is a better method in R to test for significance of loadings (not the
significance of the PCs).
 Maybe it is not a good idea to do it at all, but I would prefer to have
some guidline for interpretation rather than making decisions arbitrarily. I
tried to look everywhere before posting here.

 I would be very thankful for any help,

Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

Axel Strauß

Mon, Jan 19, 2009 3:25 PM #

@ Stas

Thanks for the extensive answer! I squeezed my data in your function but 
still need to mull over it and your comments for some time.

???????,
Axel




Stas Kolenikov schrieb:

I don't know if there are bugs in the code, but the step 4) does not
compute significance... at least the way statisticians know it. The
fractions above or below 0 are not significance. I don't even know how
to call those... probably cdf of the bootstrap distribution evaluated
at zero.

Let's put the flies and the sausages separately, as a Russian saying
goes. If you bootstrap from your original data, you can get the
confidence intervals, but not the test statistics. What you can do
with your bootstrap from the raw data is accumulate the distribution
of the eigenvectors, along the lines of (assuming you are testing for
the significance of variables in your first component):

function (x, permutations=1000, ...)
{
  pcnull <- princomp(x, ... )
  res <- pcnull$loadings[,1]
  bsresults = matrix( rep.int(NA, permutations*NROW(res)) ,
nrow=permutations, ncol=NROW(res) )
  N <- nrow(x)
  for (i in 1:permutations) {
      pc <- princomp(x[sample(N, replace=TRUE), ], ... )
      pred <- predict(pc, newdata = x)
      r <-  cor(pcnull$scores, pred)
      k <- apply(abs(r), 2, which.max)
      reve <- sign(diag(r[k,]))
      sol <- pc$loadings[ ,k]
      sol <- sweep(sol, 2, reve, "*")
      bsresults[i,] <- t(sol[,1])
  }
  apply( bsresults, 2, quantile, c(0.05, 0.95) )
}

if I am not messing up the dimensions and other stuff too much.
However as a result you will get an approximately degenerate
distribution sitting on the unit sphere since the eigenvectors are
always normalized to have the length of 1. You can still do marginal
confidence intervals with that though, and see if 0 is covered.

The main problem here is I am not entirely sure the bootstrap is
applicable for the problem at hand. In other words, it is not clear
whether the bootstrap estimates are consistent for the true
variability. Eigenproblems are quite difficult and prone to
non-regularities (the above mentioned degeneracy is just one of them,
and probably not the worst one). There are different asymptotics
(Anderson's of fixed p and n \to \infty,
http://www.citeulike.org/user/ctacmo/article/3908837, and Johnstone
joint p and n \to \infty with p/n \to const,
http://www.citeulike.org/user/ctacmo/article/3908846), the
distributions are often non-standard, while the bootstrap works well
when the distributions of the estimates are normal. When you have
something weird, bootstrap may easily break down, and in a lot of
other situations, you need to come up with special schemes. See my
oh-so-favorite paper on the bootstrap,
http://www.citeulike.org/user/ctacmo/article/575126.

One of those special schemes (back to out muttons, or rather flies and
sausages) -- to set up the bootstrap for hypothesis testing and get
the p-values, you need to bootstrap from the distribution that
corresponds to the null. Beran and Srivastava (1985 Annals,
http://www.citeulike.org/user/ctacmo/article/3015345) discuss how to
rotate your data to conform to the null hypothesis of interest for
inference with covariance matrices and their functions (such as
eigenvalues, for instance). Whether you need to go into all this
trouble, I don't really know.

If you have an inferential problem of testing whether a particular
variable contributes to the overall index, and have a pretty good idea
of where each variable goes, may be you need to shift your paradigm
and look at confirmatory factor analysis models instead, estimable in
R with John Fox' sem package.

On 1/19/09, Axel Strau? <a.strauss at tu-bs.de> wrote:

G'Day R users!

 Following an ordination using prcomp, I'd like to test which variables
singnificantly contribute to a principal component. There is a method
suggested by Peres-Neto and al. 2003. Ecology 84:2347-2363  called
"bootstrapped eigenvector". It was asked for that in this forum in January
2005 by J?r?me Lema?tre:
 "1) Resample 1000 times with replacement entire raws from the original data
sets []
 2) Conduct a PCA on each bootstrapped sample
 3) To prevent axis reflexion and/or axis reordering in the bootstrap, here
are two more steps for each bootstrapped sample
 3a) calculate correlation matrix between the PCA scores of the original and
those of the bootstrapped sample
 3b) Examine whether the highest absolute correlation is between the
corresponding axis for the original and bootstrapped samples. When it is not
the case, reorder the eigenvectors. This means that if the highest
correlation is between the first original axis and the second bootstrapped
axis, the loadings for the second bootstrapped axis and use to estimate the
confidence interval for the original first PC axis.
 4) Determine the p value for each loading. Obtained as follow: number of
loadings >=0 for loadings that were positive in the original matrix divided
by the number of boostrap samples (1000) and/or number of loadings =<0 for
loadings that were negative in the original matrix divided by the number of
boostrap samples (1000)."

 (see
https://stat.ethz.ch/pipermail/r-help/2005-January/065139.html
).

 The suggested solution (by Jari Oksanen) was


 function (x, permutations=1000, ...)
 {
   pcnull <- princomp(x, ...)
   res <- pcnull$loadings
   out <- matrix(0, nrow=nrow(res), ncol=ncol(res))
   N <- nrow(x)
   for (i in 1:permutations) {
       pc <- princomp(x[sample(N, replace=TRUE), ], ...)
       pred <- predict(pc, newdata = x)
       r <-  cor(pcnull$scores, pred)
       k <- apply(abs(r), 2, which.max)
       reve <- sign(diag(r[k,]))
       sol <- pc$loadings[ ,k]
       sol <- sweep(sol, 2, reve, "*")
       out <- out + ifelse(res > 0, sol <=  0, sol >= 0)
   }
   out/permutations
 }

 However, in a post from March 2005 (
http://r-help.com/msg/6125.html ) Jari himself mentioned
that there is a bug in this method.

 I was wondering whether someone could tell me where the bug is or whether
there is a better method in R to test for significance of loadings (not the
significance of the PCs).
 Maybe it is not a good idea to do it at all, but I would prefer to have
some guidline for interpretation rather than making decisions arbitrarily. I
tried to look everywhere before posting here.

 I would be very thankful for any help,

Gravity is a habit that is hard to shake off.
Terry Pratchett