find high correlated variables in a big matrix
Look at varclus() in package Hmisc or package ClustOfVar. ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Lida Zeighami Sent: Tuesday, May 10, 2016 11:30 AM To: David Winsemius; clint at ecy.wa.gov Cc: r-help Subject: Re: [R] find high correlated variables in a big matrix Thank you David for your reply, But still couldn't get my answer. I've already used the rcorr and created the correlation matrix and found the high correlated variables but just among the two variables, it means I could find the pairs of variables with high correlation. So I couldn't get for example 100 variables that all of them are high correlated together. Dear Clint, I think you are right! It's better to tell that I'm trying to find clusters of variables according to some distance metric! would you please let me know how I can solve it? Thanks On Fri, May 6, 2016 at 4:32 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On May 6, 2016, at 2:12 PM, Lida Zeighami <lid.zigh at gmail.com> wrote: Hi there, Is there any way to find out high correlated variables among a big
matrix?
for example I have a matrix called data= 2000*5000 and I need to find the high correlated variables between the variables in the columns! (Need 100 high correlated variables from 5000 variables in column) I could calculate the correlation matrix and pick the high correlated
ones
but my problem is, I just can pick pairs of variables with high
correlation
and may be we have low correlation across the pairs! Means, in my 100*100 correlation matrix, there are some pairs with low correlation and I couldn't find the 100 variables which they all have high correlation together!!! Would you please ley me know if there is any way?
The rcorr function in Hmisc will return a list whose first element is a correlation matrix
base <- rnorm(100)
test <- matrix(base+0.2*rnorm(300), 100)
rcorr(test)[[1]]
[,1] [,2] [,3] [1,] 1.0000000 0.9631220 0.9721688 [2,] 0.9631220 1.0000000 0.9666564 [3,] 0.9721688 0.9666564 1.0000000 You can use which to to find the locations meeting a criterion (or two):
mycorr <- .Last.value
which(mycorr > 0.97 & mycorr != 1, arr.ind=TRUE)
row col [1,] 3 1 [2,] 1 3 -- David Winsemius Alameda, CA, USA
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.