Skip to content

cor(data.frame) infelicities

5 messages · Gabor Grothendieck, Liaw, Andy, Michael Friendly

#
In using cor(data.frame), it is annoying that you have to explicitly 
filter out non-numeric columns, and when you don't, the error message
is misleading:

 > cor(iris)
Error in cor(iris) : missing observations in cov/cor
In addition: Warning message:
In cor(iris) : NAs introduced by coercion

It would be nicer if stats:::cor() did the equivalent *itself* of the 
following for a data.frame:
 > cor(iris[,sapply(iris, is.numeric)])
              Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
 >

A change could be implemented here:
     if (is.data.frame(x))
         x <- as.matrix(x)

Second, the default, use="all" throws an error if there are any
NAs.  It would be nicer if the default was use="complete.cases",
which would generate warnings instead.  Most other statistical
software is more tolerant of missing data.

 > library(corrgram)
 > data(auto)
 > cor(auto[,sapply(auto, is.numeric)])
Error in cor(auto[, sapply(auto, is.numeric)]) :
   missing observations in cov/cor
 > cor(auto[,sapply(auto, is.numeric)],use="complete")
# works; output elided

-Michael
#
You can calculate the Kendall rank correlation with such a matrix
so you would not want to exclude factors in that case:
Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
Sepal.Length   1.00000000 -0.07699679    0.7185159   0.6553086  0.6704444
Sepal.Width   -0.07699679  1.00000000   -0.1859944  -0.1571257 -0.3376144
Petal.Length   0.71851593 -0.18599442    1.0000000   0.8068907  0.8229112
Petal.Width    0.65530856 -0.15712566    0.8068907   1.0000000  0.8396874
Species        0.67044444 -0.33761438    0.8229112   0.8396874  1.0000000
On Dec 3, 2007 9:27 AM, Michael Friendly <friendly at yorku.ca> wrote:
#
I'd call that another infelicity.  Species is supposed to be nominal,
not ordinal, so rank correlation wouldn't make much sense.  So what does
cor(, method="kendall") do?  It looks like it simply uses the underlying
numeric code.  (Change Species to numerics and you'll see the same
answer.)  However, reordering the levels changes the result:

R> iris2 <- iris
R> levels(iris2$Species) <- levels(iris2$Species)[c(2, 1, 3)]
R> cor(iris2, method = "kendall")
             Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
Sepal.Length   1.00000000 -0.07699679    0.7185159   0.6553086 0.1897778
Sepal.Width   -0.07699679  1.00000000   -0.1859944  -0.1571257 0.1439793
Petal.Length   0.71851593 -0.18599442    1.0000000   0.8068907 0.2677154
Petal.Width    0.65530856 -0.15712566    0.8068907   1.0000000 0.2724843
Species        0.18977778  0.14397927    0.2677154   0.2724843 1.0000000

To me, this is dangerous!

Andy
 

From: Gabor Grothendieck
------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachme...{{dropped:15}}
#
You are right but I was just trying to stick to the same example.
In reality it would be ok as long as its an ordered factor.  One could
restrict it to those of class "ordered".
On Dec 3, 2007 1:58 PM, Liaw, Andy <andy_liaw at merck.com> wrote:
#
Returning to my original post, I still believe that a basic work-horse
like cor(data.frame) with the default method="pearson" should try to do 
something more useful in this case than barf with a misleading error
message if the data frame contains character variables.

To paraphrase Einstein,
``Things [in R] should be made as simple as possible, but not any simpler''

The case that Andy Liaw cited is a good example of the 'not any
simpler' part.

-Michael
Gabor Grothendieck wrote: