Hi,
I want to simulate a data set with similar covariance structure as my
observed data, and have calculated a covariance matrix (dimensions
8368*8368). So far I've tried two approaches to simulating data:
rmvnorm from the mvtnorm package, and by using the Cholesky
decomposition (http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/).
The problem is that the resulting covariance structure in my simulated
data is very different from the original supplied covariance vector.
Lets just look at some of the values:
[,1] [,2] [,3] [,4]
[1,] 0.1873402987 0.001837229 0.0009009272 0.010324521
[2,] 0.0018372286 0.188665853 0.0124216535 -0.001755035
[3,] 0.0009009272 0.012421654 0.1867835412 -0.000142395
[4,] 0.0103245214 -0.001755035 -0.0001423950 0.192883488
So the distribution of the observed covariance is very narrow compared
to the simulated data.
None of the eigenvalues of the observed covariance matrix are
negative, and it appears to be a positive definite matrix. Here is
what I did to create the simulated data:
Chol <- chol(CEUcovar)
Z <- matrix(rnorm(20351 * 8368), 8368)
X <- t(Chol) %*% Z
sample8 <- data.frame(as.matrix(t(X)))
dim(sample8)
[1] 20351 8368
cov8=cov(sample8,method='spearman')
[earlier I've also tried sample8 <- rmvnorm(1000,
mean=rep(0,ncol(CEUcovar)), sigma=CEUcovar, method="eigen") with as
'bad' results, much larger covariance values in the simulated data ]
Any ideas of WHY the simulated data have such a different covariance?
Any experience with similar issues? Would be happy to supply the
covariance matrix if anyone wants to give it a try.
Any suggestions? Anything apparent that I left our or neglected?
Any advice would be highly appreciated.
Best,
Bo
Sampling error? Do you realize how large a sample size you would
need to precisely estimate an 8000 x 8000 covariance matrix? Probably
exceeds the number of stars in our galaxy...
Numerical issues may also play a role, but I am too ignorant on this
aspect to offer advice.
Finally, this is really not an R question, so you would probably do
better to post on a stats site like stats.stackexchange.com rather
than here.
-- Bert
On Sat, Aug 11, 2012 at 7:17 AM, Boel Brynedal <brynedal at gmail.com> wrote:
Hi,
I want to simulate a data set with similar covariance structure as my
observed data, and have calculated a covariance matrix (dimensions
8368*8368). So far I've tried two approaches to simulating data:
rmvnorm from the mvtnorm package, and by using the Cholesky
decomposition (http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/).
The problem is that the resulting covariance structure in my simulated
data is very different from the original supplied covariance vector.
Lets just look at some of the values:
[,1] [,2] [,3] [,4]
[1,] 0.1873402987 0.001837229 0.0009009272 0.010324521
[2,] 0.0018372286 0.188665853 0.0124216535 -0.001755035
[3,] 0.0009009272 0.012421654 0.1867835412 -0.000142395
[4,] 0.0103245214 -0.001755035 -0.0001423950 0.192883488
So the distribution of the observed covariance is very narrow compared
to the simulated data.
None of the eigenvalues of the observed covariance matrix are
negative, and it appears to be a positive definite matrix. Here is
what I did to create the simulated data:
Chol <- chol(CEUcovar)
Z <- matrix(rnorm(20351 * 8368), 8368)
X <- t(Chol) %*% Z
sample8 <- data.frame(as.matrix(t(X)))
dim(sample8)
[1] 20351 8368
cov8=cov(sample8,method='spearman')
[earlier I've also tried sample8 <- rmvnorm(1000,
mean=rep(0,ncol(CEUcovar)), sigma=CEUcovar, method="eigen") with as
'bad' results, much larger covariance values in the simulated data ]
Any ideas of WHY the simulated data have such a different covariance?
Any experience with similar issues? Would be happy to supply the
covariance matrix if anyone wants to give it a try.
Any suggestions? Anything apparent that I left our or neglected?
Any advice would be highly appreciated.
Best,
Bo
Hi, thanks for the reply.
I am not assuming that the supplied covariance vector in any way
captures the 'true' covariance matrix of the population, but thats not
what I am after either. I just want to simulate data that has a
similar covariance as that covariance matrix. And the numbers are so
hugely different! Could sampling error cause that? Could the
covariance structure be too complicated to simulate?
And yes, I should probably post this on a stat-list, true.
Thanks,
Bo
2012/8/11 Bert Gunter <gunter.berton at gene.com>:
Sampling error? Do you realize how large a sample size you would
need to precisely estimate an 8000 x 8000 covariance matrix? Probably
exceeds the number of stars in our galaxy...
Numerical issues may also play a role, but I am too ignorant on this
aspect to offer advice.
Finally, this is really not an R question, so you would probably do
better to post on a stats site like stats.stackexchange.com rather
than here.
-- Bert
On Sat, Aug 11, 2012 at 7:17 AM, Boel Brynedal <brynedal at gmail.com> wrote:
Hi,
I want to simulate a data set with similar covariance structure as my
observed data, and have calculated a covariance matrix (dimensions
8368*8368). So far I've tried two approaches to simulating data:
rmvnorm from the mvtnorm package, and by using the Cholesky
decomposition (http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/).
The problem is that the resulting covariance structure in my simulated
data is very different from the original supplied covariance vector.
Lets just look at some of the values:
[,1] [,2] [,3] [,4]
[1,] 0.1873402987 0.001837229 0.0009009272 0.010324521
[2,] 0.0018372286 0.188665853 0.0124216535 -0.001755035
[3,] 0.0009009272 0.012421654 0.1867835412 -0.000142395
[4,] 0.0103245214 -0.001755035 -0.0001423950 0.192883488
So the distribution of the observed covariance is very narrow compared
to the simulated data.
None of the eigenvalues of the observed covariance matrix are
negative, and it appears to be a positive definite matrix. Here is
what I did to create the simulated data:
Chol <- chol(CEUcovar)
Z <- matrix(rnorm(20351 * 8368), 8368)
X <- t(Chol) %*% Z
sample8 <- data.frame(as.matrix(t(X)))
dim(sample8)
[1] 20351 8368
cov8=cov(sample8,method='spearman')
[earlier I've also tried sample8 <- rmvnorm(1000,
mean=rep(0,ncol(CEUcovar)), sigma=CEUcovar, method="eigen") with as
'bad' results, much larger covariance values in the simulated data ]
Any ideas of WHY the simulated data have such a different covariance?
Any experience with similar issues? Would be happy to supply the
covariance matrix if anyone wants to give it a try.
Any suggestions? Anything apparent that I left our or neglected?
Any advice would be highly appreciated.
Best,
Bo
Hi,
I want to simulate a data set with similar covariance structure as my
observed data, and have calculated a covariance matrix (dimensions
8368*8368). So far I've tried two approaches to simulating data:
rmvnorm from the mvtnorm package, and by using the Cholesky
decomposition
(http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-normal-generation/).
The problem is that the resulting covariance structure in my simulated
data is very different from the original supplied covariance vector.
It is, of course, not guaranteed to be the same as you are only
sampling from the distribution. In your example below you draw a
sample of size 1000 from a 8368 variable distribution so I suspect it
is almost sure to be different although I am surprised how different.
What happens if you increase the sample size?
[,1] [,2] [,3] [,4]
[1,] 0.1873402987 0.001837229 0.0009009272 0.010324521
[2,] 0.0018372286 0.188665853 0.0124216535 -0.001755035
[3,] 0.0009009272 0.012421654 0.1867835412 -0.000142395
[4,] 0.0103245214 -0.001755035 -0.0001423950 0.192883488
So the distribution of the observed covariance is very narrow compared
to the simulated data.
None of the eigenvalues of the observed covariance matrix are
negative, and it appears to be a positive definite matrix. Here is
what I did to create the simulated data:
Chol <- chol(CEUcovar)
Z <- matrix(rnorm(20351 * 8368), 8368)
X <- t(Chol) %*% Z
sample8 <- data.frame(as.matrix(t(X)))
dim(sample8)
[1] 20351 8368
cov8=cov(sample8,method='spearman')
[earlier I've also tried sample8 <- rmvnorm(1000,
mean=rep(0,ncol(CEUcovar)), sigma=CEUcovar, method="eigen") with as
'bad' results, much larger covariance values in the simulated data ]
Any ideas of WHY the simulated data have such a different covariance?
Any experience with similar issues? Would be happy to supply the
covariance matrix if anyone wants to give it a try.
Any suggestions? Anything apparent that I left our or neglected?
Any advice would be highly appreciated.
Best,
Bo
There's your problem. I'm surprised that nobody seems to have picked up on this, but Spearman covariances are of the ranks, not of the data. Try method="pearson".
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Thanks for these replies.
@Peter - are these methods only suitable for pearson covariances? That
would def explain my issues. Sorry for my ignorance, but I would
highly appreciate an explanation. My original covariance matrix is
calculated using spearman as well (which is suitable for the data).
@Michael - I am simulating a sample size of 20351* 8368 so I do not
think that the sample size is the issue here.
2012/8/12 peter dalgaard <pdalgd at gmail.com>:
On Aug 11, 2012, at 16:17 , Boel Brynedal wrote:
cov8=cov(sample8,method='spearman')
There's your problem. I'm surprised that nobody seems to have picked up on this, but Spearman covariances are of the ranks, not of the data. Try method="pearson".
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
A clarification - yes, calculating the pearson covariance does give
the expected results. I dont fully understand why yet, but many thanks
for this help!
2012/8/12 Boel Brynedal <brynedal at gmail.com>:
Thanks for these replies.
@Peter - are these methods only suitable for pearson covariances? That
would def explain my issues. Sorry for my ignorance, but I would
highly appreciate an explanation. My original covariance matrix is
calculated using spearman as well (which is suitable for the data).
@Michael - I am simulating a sample size of 20351* 8368 so I do not
think that the sample size is the issue here.
2012/8/12 peter dalgaard <pdalgd at gmail.com>:
On Aug 11, 2012, at 16:17 , Boel Brynedal wrote:
cov8=cov(sample8,method='spearman')
There's your problem. I'm surprised that nobody seems to have picked up on this, but Spearman covariances are of the ranks, not of the data. Try method="pearson".
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On Sun, Aug 12, 2012 at 1:46 PM, Boel Brynedal <brynedal at gmail.com> wrote:
A clarification - yes, calculating the pearson covariance does give
the expected results. I dont fully understand why yet, but many thanks
for this help!
I'm not sure that the spearman correlation is an appropriate estimator
for the covariance matrix of a multivariate normal, which is defined
in terms of the pearson correlation matrix. (More bluntly, pearson and
spearman are different measures and one won't converge to the other)
A quick unscientific test:
#################
set.seed(1)
covMat <- matrix(c(1, 0.4875, 0.4875, 1), 2, 2) # Arbitrary
library(MASS)
library(TTR) # For fast pearson cor
n <- 5000
rands <- mvrnorm(n, c(0,0), covMat, empirical = TRUE) # I'm pretty
sure we want empirical = TRUE
runPearson <- runCor(rands[,1], rands[,2], cumulative = TRUE)
# This takes a little while but I'm doing my best to make it fast ;-)
runSpearman <- vapply(seq(10, n), function(n) cor(rands[seq_len(n), ],
method = "spearman")[2], numeric(1))
plot(runPearson, type = "l")
lines(runSpearman, col = 2)
# Show that we get good/decent convergence
abline(h = covMat[2], col = 3)
##############
That long stable difference suggests to me that you don't want to use
cor(,,"spearman") to estimate a quantity defined in terms of cor(,,
"pearson").
I am not sure if this is a general/fixed bias in the spearman
estimator or if it's just a function of the covMat I randomly chose.
Prof. Dalgaard and many others on this list must know.
Cheers,
Michael
On Sun, Aug 12, 2012 at 7:52 PM, R. Michael Weylandt
<michael.weylandt at gmail.com> wrote:
I am not sure if this is a general/fixed bias in the spearman
estimator or if it's just a function of the covMat I randomly chose.
Prof. Dalgaard and many others on this list must know.
To somewhat answer myself, I restricted my attention to 2x2 random
normals and played around with variations of this script, using
standard normals for the marginals:
######################
library(MASS)
x <- seq(-1, 1, length.out = 1000)
bias <- numeric(length(x))
n <- 100
pb <- txtProgressBar(0, length(x))
for(i in seq_along(x)) {
rands <- mvrnorm(n, c(0,0), matrix(c(1, x[i],x[i],1),2), empirical = TRUE)
bias[i] <- cor(rands, method = "s")[2] - cor(rands, method = "p")[2]
# Change "s" -> "k" for kendall's tau below
if(!(i %% 50)) setTxtProgressBar(pb, i)
}
plot(x, bias, type = "l", xlab = "True Correlation", ylab = "Est.
Bias: Spearman - Pearson")
#######################
# For n <- 100
summary(bias)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1112000 -0.0186000 -0.0007219 -0.0007062 0.0172600 0.1177000
summary(abs(bias))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.008928 0.017900 0.024240 0.035650 0.117700
# For n <- 10000
summary(bias)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.642e-02 -1.272e-02 -5.491e-04 -8.178e-05 1.277e-02 2.372e-02
summary(abs(bias))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.006904 0.012750 0.011840 0.016490 0.026420
where the general trend is sinusoidal with zeros at x = -1, 0, 1. The
sinusoid is more definite for large n, but has less amplitude.
I then moved to kendall's tau[0] which gives a very different shape
(sort of like nike swoosh, but also an odd function). Notably the
estimated difference is of a different order of magnitude here, even
for a large sample:
# n <- 50
summary(bias)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.2966000 -0.1490000 0.0047720 0.0004882 0.1479000 0.3003000
summary(abs(bias))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.08273 0.14850 0.13690 0.19210 0.30030
# n <- 5000
summary(bias)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.131e-01 -1.525e-01 0.000e+00 8.379e-05 1.531e-01 2.137e-01
There is also the chance that your sampling code is not correct.
Have you tried it out on, say, 5 dimensional data with increasing
numbers of samples?
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Michael Dewey
Sent: Sunday, August 12, 2012 6:54 AM
To: Boel Brynedal; R-help Mailing List
Subject: Re: [R] Problem when creating matrix of values based on covariance matrix
At 15:17 11/08/2012, Boel Brynedal wrote:
Hi,
I want to simulate a data set with similar covariance structure as my
observed data, and have calculated a covariance matrix (dimensions
8368*8368). So far I've tried two approaches to simulating data:
rmvnorm from the mvtnorm package, and by using the Cholesky
decomposition
(http://www.cerebralmastication.com/2010/09/cholesk-post-on-correlated-random-
normal-generation/).
The problem is that the resulting covariance structure in my simulated
data is very different from the original supplied covariance vector.
It is, of course, not guaranteed to be the same as you are only
sampling from the distribution. In your example below you draw a
sample of size 1000 from a 8368 variable distribution so I suspect it
is almost sure to be different although I am surprised how different.
What happens if you increase the sample size?
[,1] [,2] [,3] [,4]
[1,] 0.1873402987 0.001837229 0.0009009272 0.010324521
[2,] 0.0018372286 0.188665853 0.0124216535 -0.001755035
[3,] 0.0009009272 0.012421654 0.1867835412 -0.000142395
[4,] 0.0103245214 -0.001755035 -0.0001423950 0.192883488
So the distribution of the observed covariance is very narrow compared
to the simulated data.
None of the eigenvalues of the observed covariance matrix are
negative, and it appears to be a positive definite matrix. Here is
what I did to create the simulated data:
Chol <- chol(CEUcovar)
Z <- matrix(rnorm(20351 * 8368), 8368)
X <- t(Chol) %*% Z
sample8 <- data.frame(as.matrix(t(X)))
dim(sample8)
[1] 20351 8368
cov8=cov(sample8,method='spearman')
[earlier I've also tried sample8 <- rmvnorm(1000,
mean=rep(0,ncol(CEUcovar)), sigma=CEUcovar, method="eigen") with as
'bad' results, much larger covariance values in the simulated data ]
Any ideas of WHY the simulated data have such a different covariance?
Any experience with similar issues? Would be happy to supply the
covariance matrix if anyone wants to give it a try.
Any suggestions? Anything apparent that I left our or neglected?
Any advice would be highly appreciated.
Best,
Bo