Skip to content

Correlate

9 messages · Val, John Fox, Bert Gunter

#
Dear Val,
On 2022-08-22 1:33 p.m., Val wrote:
This seems backwards to me, but I'll refrain from commenting further on 
whether what you want to do makes sense and instead address how to do it 
(not, BTW, because I disagree with Bert's and Tim's remarks).

Please see below:
Because of missing data, this computes the correlations on different 
subsets of the data. A simple solution is to filter the data for NAs:

D <- na.omit(dat)

More comments below:
This data manipulation is unnecessary. Just specify the argument 
na.strings="." to read.table().
Taking a somewhat different approach from cor.test(), you can apply 
Fisher's z-transformation (recall that D is the data filtered for NAs):

 > 2*pnorm(abs(atanh(data_cor)), sd=1/sqrt(nrow(D) - 3), lower.tail=FALSE)
         [,1]
x2 0.2462807
x3 0.3812854
x4 0.1156939

I hope this helps,
  John
3 days later
Val
#
Hi John and Timothy

Thank you for your suggestion and help. Using the sample data, I did
carry out a test run and found a difference in the correlation result.

Option 1.
data_cor <- cor(dat[ , colnames(dat) != "x1"],  # Calculate correlations
                    dat$x1, method = "pearson", use = "complete.obs")
resulted
                 [,1]
    x2 -0.5845835
    x3 -0.4664220
    x4  0.7202837

Option 2.
 for(i in colnames(dat)){
      print(cor.test(dat[,i], dat$x1, method = "pearson", use =
"complete.obs")$estimate)
    }
           [,1]
x2  -0.7362030
x3  -0.04935132
x4   0.85766290

This was crosschecked  using Excel and other softwares and all matches
with option 2.
One of the factors that contributed for this difference  is loss of
information when we are using na.rm(). This is because that if x2 has
missing value but x3 and x4 don?t have then  na.rm()  removed  entire
row information including x3 and x4.

My question is there  a way to extract the number of rows (N)  used in
the correlation analysis?.
Thank you,
On Mon, Aug 22, 2022 at 1:00 PM John Fox <jfox at mcmaster.ca> wrote:
#
Dear Val,
On 2022-08-26 10:41 a.m., Val wrote:
Yes, I already explained that in my previous message.

As well, cor() is capable of computing pairwise-complete correlations -- 
see ?cor.

There's not an obvious right answer here, however. Using 
pairwise-complete correlations can produce inconsistent (i.e., 
non-positive semi-definite) correlation matrices because correlations 
are computed on different subsets of the data.

There are much better ways to deal with missing data.
I'm sure that there are many ways, but here is one that is very 
simple-minded and should be reasonably efficient for ~250 variables:

 > (nc <- ncol(dat))
[1] 4

 > R <- N <- matrix(NA, nc, nc)
 > diag(R) <- 1
 > for (i in 1:(nc - 1)){
+   for (j in (i + 1):nc){
+     R[i, j] <- R[j, i] <-cor(dat[, i], dat[, j], use="complete.obs")
+     N[i, j] <- N[j, i] <- nrow(na.omit(dat[, c(i, j)]))
+   }
+ }

 > round(R, 3)
        [,1]   [,2]   [,3]   [,4]
[1,]  1.000 -0.736 -0.049  0.858
[2,] -0.736  1.000  0.458 -0.428
[3,] -0.049  0.458  1.000  0.092
[4,]  0.858 -0.428  0.092  1.000

 > N
      [,1] [,2] [,3] [,4]
[1,]   NA    8    8    8
[2,]    8   NA    8    8
[3,]    8    8   NA    8
[4,]    8    8    8   NA

 > round(cor(dat, use="pairwise.complete.obs"), 3) # check
        x1     x2     x3     x4
x1  1.000 -0.736 -0.049  0.858
x2 -0.736  1.000  0.458 -0.428
x3 -0.049  0.458  1.000  0.092
x4  0.858 -0.428  0.092  1.000

More generally, I think that it's a good idea to learn a little bit 
about R programming if you intend to use R in your work. You'll then be 
able to solve problems like this yourself.

I hope this helps,
  John
Val
#
Thank you John for your help and advice.
On Fri, Aug 26, 2022 at 11:04 AM John Fox <jfox at mcmaster.ca> wrote:
#
Dear Val,
On 2022-08-26 4:06 p.m., Val wrote:
You're welcome, and it occurred to me that if you want the number of 
non-missing cases for each individual variable, you could add

	diag(N) <- colSums(!is.na(dat))

to the code I sent earlier.

Best,
  John
#
And just for fun, here is a more "functional" version without the
explicit 'for()' loops that gives the lower triangular values of the
counts of nonmissing values for each correlation:

## (using John's notation)
w <- expand.grid(1:nc,1:nc)
f <- \(x) ifelse(x[1] <= x[2], NA, nrow(na.omit(dat[,x])))
ans <- matrix(apply(w,1,f), ncol = nc)

Note: This is just a matter of taste. I would be amazed if it's any
faster than using the explicit loops (they are implicit here, but
still there -- and it might be slower!). But maybe it reinforces
John's comment about the value of spending some(more) effort to learn
how to program in R. Gives you flexibility to program however you
like, which usually reduces bugs in one's code, I believe.

Cheers,
Bert







Note: this is just
On Fri, Aug 26, 2022 at 9:04 AM John Fox <jfox at mcmaster.ca> wrote:
Val
#
Thank you John again. I have got my result  in the form as shown below
      Variable   Corr     Pvalue   Nobs
            x1      1.000       0     165425
On Fri, Aug 26, 2022 at 3:59 PM John Fox <jfox at mcmaster.ca> wrote:
#
Dear Val,
On 2022-08-26 6:27 p.m., Val wrote:
This raises two questions: (1) Are p-values interesting when you have n 
 > 165K cases? (2) What's the point of testing a correlation between a 
variable and itself?

Best,
  John
Val
#
HI John,

Please see my response below your question,
There are some pairwise comparisons that do have few number of
observations. I need to get part of information.

 > (2) What's the point of testing a correlation between a variable and itself?
  No, this part  will be removed.

Thank you
On Fri, Aug 26, 2022 at 5:57 PM John Fox <jfox at mcmaster.ca> wrote: