Hi everyone. Based on a dependent variable (y), I'm trying to generate some independent variables with a specified correlation. For this there's no problems. However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them. For example, y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 = 0.8. However, x1, x2 and x3 should not be correlated to each other. Anyone can help me? Regards, Phil -- View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4475799.html Sent from the R help mailing list archive at Nabble.com.
Generation of correlated variables
10 messages · Philippe Massicotte, Rui Barradas, Bert Gunter +2 more
Hello,
However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them.
?poly poly(cbind(x1, x2, x3), degree=1) Hope this helps, Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4475821.html Sent from the R help mailing list archive at Nabble.com.
Hi there. This is not really working. Here what I have for the moment. library(ecodist) x <- 1:100 y1 <- corgen(x=x, r=.85, epsilon=.01)$y y2 <- corgen(x=x, r=.5, epsilon=.01)$y y3 <- corgen(x=x, r=.25, epsilon=.01)$y a = poly(cbind(y1, y2, y3), degree=1) cor(a[,1], a[,2]) In that case, the correlation between y1 and y2 is the same as a[,1] and a[,2]. Regards, Phil -- View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4475891.html Sent from the R help mailing list archive at Nabble.com.
On Thu, Mar 15, 2012 at 10:48:48AM -0700, Filoche wrote:
Hi everyone. Based on a dependent variable (y), I'm trying to generate some independent variables with a specified correlation. For this there's no problems. However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them. For example, y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 = 0.8. However, x1, x2 and x3 should not be correlated to each other.
Hi. If the following computation is correct, then there is no solution for the required correlations, but there is one, if the vector of the required correlations is normalized to have sum of squares 1. Assume, variables x1, x2, x3 have mean zero and denote s1^2 = var(x1), s2^2 = var(x2), s3^2 = var(x3) and assume zero correlations among x1, x2, x3, so also zero covariances. Then var(y) = s1^2 + s2^2 + s3^2 E y x1 = E x1^2 + E x1 x2 + E x1 x3 = E x1^2 = s1^2 and similarly E y x2 = var(x2) = s2^2 E y x3 = var(x3) = s3^2 So, the correlation cor(y, x1) is s1^2/s1/sqrt(s1^2 + s2^2 + s3^2) = s1/sqrt(s1^2 + s2^2 + s3^2) Expressing all the correlations in this way, we get cor(y, x1) = s1/sqrt(s1^2 + s2^2 + s3^2) cor(y, x2) = s2/sqrt(s1^2 + s2^2 + s3^2) cor(y, x3) = s3/sqrt(s1^2 + s2^2 + s3^2) Clearly, we have cor(y, x1)^2 + cor(y, x2)^2 + cor(y, x3)^2 = 1. For your numbers, we get r <- c(0.7, 0.4, 0.8) sum(r^2) # [1] 1.29 So, for these numbers, the conditions are contradictory. However, a solution may be found for the vector of correlations r/sqrt(1.29) [1] 0.6163156 0.3521804 0.7043607 which are the original correlations normalized to have sum of squares 1. In this case, independent normal variables with the standard deviations (s1, s2, s3) == r/sqrt(1.29) will satisfy your conditions. I hope that other members of the list correct me, if i overlooked something. Petr Savicky.
Hello, again.
This is not really working. Here what I have for the moment.
Right, I've read only half of the description line of function 'poly'. The
other half states that
"These are all orthogonal to the constant polynomial of degree 0."
But not pairwise orthogonal.
You can look for packages implementing Gram-Schmidt orthogonalization
library(sos)
r1 <- findFn('Gram')
r2 <- findFn('Schmidt')
r1 & r2
or use a (time consuming) trick: form a matrix with the first column as x1,
and the others as the
residuals, which are orthogonal, of the regressions of that column with all
the previous ones.
Example:
orthogonal <- function(x){
x <- cbind(1, x)
nc <- ncol(x)
res <- matrix(NA, nrow(x), nc - 1)
res[, 1] <- x[, 2]
if(nc > 2){
for(j in 3:nc)
res[, j - 1] <- lm.fit(x[, 1:(j - 1)], x[, j])$residuals
}
res
}
set.seed(1)
X <- matrix(1:10, ncol=1)
X <- cbind(X, X + rnorm(10), X + rnorm(10))
cor(X)
[,1] [,2] [,3]
[1,] 1.0000000 0.9726364 0.9485793
[2,] 0.9726364 1.0000000 0.8917848
[3,] 0.9485793 0.8917848 1.0000000
Y <- orthogonal(X)
cor(Y)
[,1] [,2] [,3]
[1,] 1.000000e+00 2.804232e-17 -3.758236e-17
[2,] 2.804232e-17 1.000000e+00 -1.492894e-17
[3,] -3.758236e-17 -1.492894e-17 1.000000e+00
This works but I would use a Gram-Schmidt algorithm.
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4476257.html
Sent from the R help mailing list archive at Nabble.com.
Thank you everyone for your precious advice. I'll take time to look at it and try to make it work. Regards, Phil -- View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4476346.html Sent from the R help mailing list archive at Nabble.com.
Rui: Try reading it again. Orthogonal polynomials are generated (subject to the caveats regarding machine precision stated therein). Note, especially, the "raw" argument. Cheers, Bert
On Thu, Mar 15, 2012 at 1:06 PM, Rui Barradas <rui1174 at sapo.pt> wrote:
Hello, again.
This is not really working. Here what I have for the moment.
Right, I've read only half of the description line of function 'poly'. The
other half states that
"These are all orthogonal to the constant polynomial of degree 0."
But not pairwise orthogonal.
You can look for packages implementing Gram-Schmidt orthogonalization
library(sos)
r1 <- findFn('Gram')
r2 <- findFn('Schmidt')
r1 & r2
or use a (time consuming) trick: form a matrix with the first column as x1,
and the others as the
residuals, which are orthogonal, of the regressions of that column with all
the previous ones.
Example:
orthogonal <- function(x){
? ? ? ?x <- cbind(1, x)
? ? ? ?nc <- ncol(x)
? ? ? ?res <- matrix(NA, nrow(x), nc - 1)
? ? ? ?res[, 1] <- x[, 2]
? ? ? ?if(nc > 2){
? ? ? ? ? ? ? ?for(j in 3:nc)
? ? ? ? ? ? ? ? ? ? ? ?res[, j - 1] <- lm.fit(x[, 1:(j - 1)], x[, j])$residuals
? ? ? ?}
? ? ? ?res
}
set.seed(1)
X <- matrix(1:10, ncol=1)
X <- cbind(X, X + rnorm(10), X + rnorm(10))
cor(X)
? ? ? ? ?[,1] ? ? ?[,2] ? ? ?[,3]
[1,] 1.0000000 0.9726364 0.9485793
[2,] 0.9726364 1.0000000 0.8917848
[3,] 0.9485793 0.8917848 1.0000000
Y <- orthogonal(X)
cor(Y)
? ? ? ? ? ? ?[,1] ? ? ? ? ?[,2] ? ? ? ? ?[,3]
[1,] ?1.000000e+00 ?2.804232e-17 -3.758236e-17
[2,] ?2.804232e-17 ?1.000000e+00 -1.492894e-17
[3,] -3.758236e-17 -1.492894e-17 ?1.000000e+00
This works but I would use a Gram-Schmidt algorithm.
Rui Barradas
--
View this message in context: http://r.789695.n4.nabble.com/Generation-of-correlated-variables-tp4475799p4476257.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
On 15-Mar-2012 Filoche wrote:
Hi everyone. Based on a dependent variable (y), I'm trying to generate some independent variables with a specified correlation. For this there's no problems. However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them). For example, y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 = 0.8. However, x1, x2 and x3 should not be correlated to each other. Anyone can help me? Regards, Phil
Your fundamental problem here (with the correlations you specify)
is the following.
Your desired correlation matrix can be constructed by
C <- cbind( c(1.0,0.7,0.4,0.8),c(0.7,1.0,0.0,0.0),
c(0.4,0.0,1.0,0.0),c(0.8,0.0,0.0,1.0) )
rownames(C) <- c("y","x1","x2","x3")
colnames(C) <- c("y","x1","x2","x3")
C
# y x1 x2 x3
# y 1.0 0.7 0.4 0.8
# x1 0.7 1.0 0.0 0.0
# x2 0.4 0.0 1.0 0.0
# x3 0.8 0.0 0.0 1.0
And now:
det(C)
# [1] -0.29
and it is impossible for the determinant of a correlation
matrix to have a negative determinant: a correlation matyrix
must be positive-semidefinite, and therefore have a non-negative
determinant.
An alternative check is to look at the eigen-structure of C:
eigen(C)
# $values
# [1] 2.1357817 1.0000000 1.0000000 -0.1357817
#
# $vectors
# [,1] [,2] [,3] [,4]
# [1,] 0.7071068 0.000000e+00 0.0000000 0.7071068
# [2,] 0.4358010 -1.172802e-16 0.7874992 -0.4358010
# [3,] 0.2490291 -8.944272e-01 -0.2756247 -0.2490291
# [4,] 0.4980582 4.472136e-01 -0.5512495 -0.4980582
so one of the eigenvalues (-0.1357817) is negative, again
impossible for a correlation matrix.
The suggestions made by the other respondents amount to
messing with the correlations so as to achieve a possible
situation; but then this does not achieve you goal (which
is impossible) but a different one, whose relationship
with your question may or may not suit what you hope to
achieve.
Ted.
-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 15-Mar-2012 Time: 23:23:24
This message was sent by XFMail
On Thu, Mar 15, 2012 at 11:23:28PM -0000, Ted Harding wrote:
On 15-Mar-2012 Filoche wrote:
Hi everyone. Based on a dependent variable (y), I'm trying to generate some independent variables with a specified correlation. For this there's no problems. However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them). For example, y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 = 0.8. However, x1, x2 and x3 should not be correlated to each other. Anyone can help me? Regards, Phil
Your fundamental problem here (with the correlations you specify)
is the following.
Your desired correlation matrix can be constructed by
C <- cbind( c(1.0,0.7,0.4,0.8),c(0.7,1.0,0.0,0.0),
c(0.4,0.0,1.0,0.0),c(0.8,0.0,0.0,1.0) )
rownames(C) <- c("y","x1","x2","x3")
colnames(C) <- c("y","x1","x2","x3")
C
# y x1 x2 x3
# y 1.0 0.7 0.4 0.8
# x1 0.7 1.0 0.0 0.0
# x2 0.4 0.0 1.0 0.0
# x3 0.8 0.0 0.0 1.0
And now:
det(C)
# [1] -0.29
and it is impossible for the determinant of a correlation
matrix to have a negative determinant: a correlation matyrix
must be positive-semidefinite, and therefore have a non-negative
determinant.
An alternative check is to look at the eigen-structure of C:
eigen(C)
# $values
# [1] 2.1357817 1.0000000 1.0000000 -0.1357817
#
# $vectors
# [,1] [,2] [,3] [,4]
# [1,] 0.7071068 0.000000e+00 0.0000000 0.7071068
# [2,] 0.4358010 -1.172802e-16 0.7874992 -0.4358010
# [3,] 0.2490291 -8.944272e-01 -0.2756247 -0.2490291
# [4,] 0.4980582 4.472136e-01 -0.5512495 -0.4980582
so one of the eigenvalues (-0.1357817) is negative, again
impossible for a correlation matrix.
Thank you for this analysis. For general correlations,
say, s1, s2, s3, the matrix is
y x1 x2 x3
y 1 s1 s2 s3
x1 s1 1 0 0
x2 s2 0 1 0
x3 s3 0 0 1
and its determinant is 1 - s1^2 - s2^2 - s3^2. Since there
was also a requirement that y = x1 + x2 + x3, the correlation
matrix should be singular. Hence, the required correlation
structure implies s1^2 + s2^2 + s3^2 = 1.
If this condition is satisfied, then a multivariate
distribution obtained by multiplying a vector from
three-dimensional N(0, I) by the matrix
(s1 s2 s3)
(s1 0 0)
( 0 s2 0)
( 0 0 s3)
has the required correlation structure.
However, this is still not a solution of the original question,
since the original requirement was to find x1, x2, x3, when y is
given. I do not know, whether a solution for an arbitrary y exists,
even if the above condition on the correlations is satisfied.
Petr Savicky.
On Thu, Mar 15, 2012 at 10:48:48AM -0700, Filoche wrote:
Hi everyone. Based on a dependent variable (y), I'm trying to generate some independent variables with a specified correlation. For this there's no problems. However, I would like that have all my "regressors" to be orthogonal (i.e. no correlation among them. For example, y = x1 + x2 + x3 where the correlation between y x1 = 0.7, x2 = 0.4 and x3 = 0.8. However, x1, x2 and x3 should not be correlated to each other.
Hi.
The previous discussion shows that the required correlations
should have sum of squares 1. It is not clear, whether your
expected application allows to normalize the correlations to
satisfy this condition. If this normalization is possible, then
try the following approach.
It creates a matrix trans, which satisfies the following.
If Z is a matrix whose columns have zero mean and the identity covariance
matrix, then Z %*% trans has the required correlation structure.
This may be checked by looking at cov2cor(t(trans) %*% trans).
Moreover, trans should have the first column c(1, 0, 0) so that
the first column of Z is not changed by the transformation.
# normalize the correlations (see the discussion above)
rho <- c(0.7, 0.4, 0.8)
rho <- rho/sqrt(sum(rho^2))
(required <- cbind(c(1, rho), rbind(c(rho), diag(3))))
trans <- cbind(
y=rho,
x1=c(rho[1], 0, 0),
x2=c(0, rho[2], 0),
x3=c(0, 0, rho[3]))
# check that trans generates the required correlations
max(abs(cov2cor(t(trans) %*% trans) - required)) # [1] 1.110223e-16
# get orthonormal columns
gram.schmidt <- function(a)
{
n <- ncol(a)
for (i in seq.int(length=n)) {
for (j in seq.int(length=i-1)) {
a[, i] <- a[, i] - sum(a[, j]*a[, i])*a[, j]
}
a[, i] <- a[, i]/sqrt(sum(a[, i]^2))
}
a
}
# multiply trans by a unitary matrix, so that the first column becomes (1, 0, 0)
U <- gram.schmidt(trans[, 1:3])
trans <- t(U) %*% trans
# check that trans still generates the required correlations
max(abs(cov2cor(t(trans) %*% trans) - required)) # [1] 1.110223e-16
# choose an example of the real vector y and normalize
y <- rnorm(100)
ynorm <- y - mean(y)
ynorm <- ynorm/sqrt(sum(ynorm^2))
# find two more normalized vectors orthogonal to ynorm
u <- gram.schmidt(cbind(1, ynorm, rnorm(length(y)), rnorm(length(y))))[, 3:4]
Z <- cbind(ynorm, u)
# get the solution as suggested at the beginning
S <- Z %*% trans
# check that S is a solution for the normalized y (ynorm)
max(abs(cor(S) - required)) # [1] 1.110223e-16
max(abs(S[, 1] - ynorm)) # [1] 8.326673e-17
max(abs(S[, 2] + S[, 3] + S[, 4] - ynorm)) # [1] 1.110223e-16
This solution contains "ynorm". In order to get "y" as required, a
linear transformation, which is inverse to the normalization, should
be done. Since it is linear, it does not change the correlations.
Hope this helps.
Petr Savicky.