Skip to content

cor() fails with big dataframe

6 messages · Martin Maechler, Brian Ripley, Mayeul Kauffmann

#
Hello,

I have a big dataframe with *NO* na's (9 columns, 293380 rows).

# doing
memory.limit(size = 1000000000)
cor(x)
#gives
Error in cor(x) : missing observations in cov/cor
In addition: Warning message:
NAs introduced by coercion

#I found the obvious workaround:
COR <- matrix(rep(0, 81),9,9)
for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor (x[,i],x[,j])}
#which works fine, with no warning

#looks like a "cor()" bug.

#I checked absence of NA's by
x <- x[complete.cases(x),]
summary(x)
apply(x,2, function (x) (sum(is.na(x))))

#I use R 1.9.1

Cheers,
Mayeul KAUFFMANN
Universit? Pierre Mend?s France
Grenoble - France
#
Mayeul> Hello,
    Mayeul> I have a big dataframe with *NO* na's (9 columns, 293380 rows).

    Mayeul> # doing
    Mayeul> memory.limit(size = 1000000000)
    Mayeul> cor(x)
    Mayeul> #gives
    Mayeul> Error in cor(x) : missing observations in cov/cor
    Mayeul> In addition: Warning message:
    Mayeul> NAs introduced by coercion

"by coercion" means there were other things *coerced* to NAs!

One of the biggest problem with R users (and other S users for
that matter) is that if they get an error, they throw hands up
and ask for help - assuming the error message to be
non-intelligible.  Whereas it *is* intelligible (slightly ? ;-)
more often than not ...


    Mayeul> #I found the obvious workaround:
    Mayeul> COR <- matrix(rep(0, 81),9,9)
    Mayeul> for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor (x[,i],x[,j])}
    Mayeul> #which works fine, with no warning

    Mayeul> #looks like a "cor()" bug.

quite improbably.

The following works flawlessly for me
and the only things that takes a bit of time is construction of
x, not cor():

  > n <- 300000
  > set.seed(1)
  > x <- as.data.frame(matrix(rnorm(n*9), n,9))
  > cx <- cor(x)
  > str(cx)
   num [1:9, 1:9]  1.00000 -0.00039  0.00113  0.00134 -0.00228 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:9] "V1" "V2" "V3" "V4" ...
    ..$ : chr [1:9] "V1" "V2" "V3" "V4" ...


    Mayeul> #I checked absence of NA's by
    Mayeul> x <- x[complete.cases(x),]
    Mayeul> summary(x)
    Mayeul> apply(x,2, function (x) (sum(is.na(x))))

    Mayeul> #I use R 1.9.1

What does
    sapply(x, function(u)all(is.finite(u)))
return ?
#
Thanks all for your answers.

#The difference between the 2 following commands might be a puzzle even
for intermediate users. (I give explanation below)
[1] -0.4352342
Error in cor(x[, 4:5]) : missing observations in cov/cor
In addition: Warning message:
NAs introduced by coercion

From: "Martin Maechler" <maechler@stat.math.ethz.ch>
To: "Mayeul KAUFFMANN" <mayeul.kauffmann@tiscali.fr>
(x[,i],x[,j])}
Martin Maechler wrote:
if it is wrong, can you say what is wrong then propose an alternate
workaround? (or should I ask on r-help).
sapply(x2, function(u)all(is.finite(u)))
  jntdem smldepnp lrgdepnp contigkb logdstab  majdyds  alliesr  lncaprt
GATT
    TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE     TRUE
TRUE

_______________________________________________

But I now got the explanation. It is not due to size.
#Tony Plate wrote:
#I would suspect that your dataframe has columns that result in NA's when
it
#is coerced to a matrix

That's not yet the explanation, but you are close to it.

All columns are numerics, except 3 that are logical (I thought they would
be coerced to 0 an 1, which they are with cor(x[,4],x[,5]) not with
cor(x[,4:5]) )
They do not changes to NA's or infinite values, they ALL change to TEXT

?as.matrix
 'as.matrix' is a generic function. The method for data frames will
     convert any non-numeric/complex column into a character vector
     using 'format' and so return a character matrix, except that
     all-logical data frames will be coerced to a logical matrix.
jntdem smldepnp     lrgdepnp    contigkb logdstab   majdyds alliesr
1 "400"  "0.01420874" "0.2156945" "TRUE"   "5.820108" "TRUE"  "TRUE"
2 "400"  "0.01534535" "0.2496879" "TRUE"   "5.820108" "TRUE"  "TRUE"
3 "400"  "0.01585586" "0.2570493" "TRUE"   "5.820108" "TRUE"  "TRUE"
  lncaprt    GATT
1 "2.883204" "1"
2 "2.906521" "1"
3 "2.833357" "1"

?cor says it accepts data.frame. In fact, it does iff they have no (or
only: cor(x[,6:7]) works) logical columns.
doing cor with a logical (a dummy variable) and a numeric is maybe not as
sensible as doing it with 2 numerics.
But it may still usefull to explore data.

Maybe one may want either to change the documentation of ?cor , or not
rely on as.matrix to convert the data.frame if some columns  are logical.


Cheers,
Mayeul
#
On Thu, 16 Sep 2004, Mayeul KAUFFMANN claimed:
It actually says

       x: a numeric vector, matrix or data frame.
            ^^^^^^^

If you want to do the conversions as you say, you should be calling
data.matrix.
On Thu, 16 Sep 2004, Mayeul KAUFFMANN wrote:

            

  
    
#
On Thu, 16 Sep 2004, Mayeul KAUFFMANN claimed:
It actually says
       x: a numeric vector, matrix or data frame.
            ^^^^^^^
If you want to do the conversions as you say, you should be calling
data.matrix.

@@@@@@@@@@@@@@@@@@@@@@@@

Thanks a lot !!!
When reading it first , I mistranslated it in my mind in a phrase that
would mean
"a numeric vector, a matrix or a data frame." (I'm not a native  English
speaker). Sorry for all that stuff....

*But* let's admit that the two followings are not treated identically:
cor(x[,4],x[,5])
cor(x[,4:5])
in the first case, the non-numeric vector is transformed to a numeric one
in the second case, the (partially) non-numeric dataframe is not
transformed to a numeric one

To be more exact,
the doc should not say
       x: a numeric vector, matrix or data frame.
            ^^^^^^^
but
       x: a vector that can be coerced to numeric,  a numeric matrix or a
numeric data frame.
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^     ^^^^^^^
^^^^^^^


Cheers,
Mayeul

PS:
by the way, if someones changes the doc,
the claim  'The default is equivalent to  'y = x' (but more efficient).'
is inexact as evidenced by the following:
X <- (data.frame(x=rep(1,5),y=1:5))
x  y
x NA NA
y NA  1
Warning message:
The standard deviation is zero in: cor(x, y, na.method, method ==
"kendall")
x  y
x  1 NA
y NA  1
Warning message:
The standard deviation is zero in: cor(x, y, na.method, method ==
"kendall")
#
We do not in general say things like `can be coerced'.  It's taken for 
granted, and hard to be precise (your phrase is not precise, for there are 
non-numeric matrices that will be coerced, too).

We do expect what is stated as valid input to work, and do encourage users
to coerce objects themselves.
On Thu, 16 Sep 2004, Mayeul KAUFFMANN wrote:

            
Where does it say that would be?