Hello,
I have a big dataframe with *NO* na's (9 columns, 293380 rows).
# doing
memory.limit(size = 1000000000)
cor(x)
#gives
Error in cor(x) : missing observations in cov/cor
In addition: Warning message:
NAs introduced by coercion
#I found the obvious workaround:
COR <- matrix(rep(0, 81),9,9)
for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor (x[,i],x[,j])}
#which works fine, with no warning
#looks like a "cor()" bug.
#I checked absence of NA's by
x <- x[complete.cases(x),]
summary(x)
apply(x,2, function (x) (sum(is.na(x))))
#I use R 1.9.1
Cheers,
Mayeul KAUFFMANN
Universit? Pierre Mend?s France
Grenoble - France
cor() fails with big dataframe
6 messages · Martin Maechler, Brian Ripley, Mayeul Kauffmann
"Mayeul" == Mayeul KAUFFMANN <mayeul.kauffmann@tiscali.fr>
on Thu, 16 Sep 2004 01:23:09 +0200 writes:
Mayeul> Hello,
Mayeul> I have a big dataframe with *NO* na's (9 columns, 293380 rows).
Mayeul> # doing
Mayeul> memory.limit(size = 1000000000)
Mayeul> cor(x)
Mayeul> #gives
Mayeul> Error in cor(x) : missing observations in cov/cor
Mayeul> In addition: Warning message:
Mayeul> NAs introduced by coercion
"by coercion" means there were other things *coerced* to NAs!
One of the biggest problem with R users (and other S users for
that matter) is that if they get an error, they throw hands up
and ask for help - assuming the error message to be
non-intelligible. Whereas it *is* intelligible (slightly ? ;-)
more often than not ...
Mayeul> #I found the obvious workaround:
Mayeul> COR <- matrix(rep(0, 81),9,9)
Mayeul> for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor (x[,i],x[,j])}
Mayeul> #which works fine, with no warning
Mayeul> #looks like a "cor()" bug.
quite improbably.
The following works flawlessly for me
and the only things that takes a bit of time is construction of
x, not cor():
> n <- 300000
> set.seed(1)
> x <- as.data.frame(matrix(rnorm(n*9), n,9))
> cx <- cor(x)
> str(cx)
num [1:9, 1:9] 1.00000 -0.00039 0.00113 0.00134 -0.00228 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:9] "V1" "V2" "V3" "V4" ...
..$ : chr [1:9] "V1" "V2" "V3" "V4" ...
Mayeul> #I checked absence of NA's by
Mayeul> x <- x[complete.cases(x),]
Mayeul> summary(x)
Mayeul> apply(x,2, function (x) (sum(is.na(x))))
Mayeul> #I use R 1.9.1
What does
sapply(x, function(u)all(is.finite(u)))
return ?
Thanks all for your answers. #The difference between the 2 following commands might be a puzzle even for intermediate users. (I give explanation below)
cor(x[,4],x[,5])
[1] -0.4352342
cor(x[,4:5])
Error in cor(x[, 4:5]) : missing observations in cov/cor In addition: Warning message: NAs introduced by coercion From: "Martin Maechler" <maechler@stat.math.ethz.ch> To: "Mayeul KAUFFMANN" <mayeul.kauffmann@tiscali.fr>
Mayeul> #I found the obvious workaround:
Mayeul> COR <- matrix(rep(0, 81),9,9)
Mayeul> for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor
(x[,i],x[,j])}
Mayeul> #which works fine, with no warning
Mayeul> #looks like a "cor()" bug.
Martin Maechler wrote:
quite improbably.
if it is wrong, can you say what is wrong then propose an alternate workaround? (or should I ask on r-help).
What does
sapply(x, function(u)all(is.finite(u)))
return ?
sapply(x2, function(u)all(is.finite(u)))
jntdem smldepnp lrgdepnp contigkb logdstab majdyds alliesr lncaprt
GATT
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE
_______________________________________________
But I now got the explanation. It is not due to size.
#Tony Plate wrote:
#I would suspect that your dataframe has columns that result in NA's when
it
#is coerced to a matrix
That's not yet the explanation, but you are close to it.
All columns are numerics, except 3 that are logical (I thought they would
be coerced to 0 an 1, which they are with cor(x[,4],x[,5]) not with
cor(x[,4:5]) )
They do not changes to NA's or infinite values, they ALL change to TEXT
?as.matrix
'as.matrix' is a generic function. The method for data frames will
convert any non-numeric/complex column into a character vector
using 'format' and so return a character matrix, except that
all-logical data frames will be coerced to a logical matrix.
as.matrix(x[1:3,1:9])
jntdem smldepnp lrgdepnp contigkb logdstab majdyds alliesr 1 "400" "0.01420874" "0.2156945" "TRUE" "5.820108" "TRUE" "TRUE" 2 "400" "0.01534535" "0.2496879" "TRUE" "5.820108" "TRUE" "TRUE" 3 "400" "0.01585586" "0.2570493" "TRUE" "5.820108" "TRUE" "TRUE" lncaprt GATT 1 "2.883204" "1" 2 "2.906521" "1" 3 "2.833357" "1" ?cor says it accepts data.frame. In fact, it does iff they have no (or only: cor(x[,6:7]) works) logical columns. doing cor with a logical (a dummy variable) and a numeric is maybe not as sensible as doing it with 2 numerics. But it may still usefull to explore data. Maybe one may want either to change the documentation of ?cor , or not rely on as.matrix to convert the data.frame if some columns are logical. Cheers, Mayeul
On Thu, 16 Sep 2004, Mayeul KAUFFMANN claimed:
?cor says it accepts data.frame. In fact, it does iff they have no (or
It actually says
x: a numeric vector, matrix or data frame.
^^^^^^^
If you want to do the conversions as you say, you should be calling
data.matrix.
On Thu, 16 Sep 2004, Mayeul KAUFFMANN wrote:
Thanks all for your answers. #The difference between the 2 following commands might be a puzzle even for intermediate users. (I give explanation below)
cor(x[,4],x[,5])
[1] -0.4352342
cor(x[,4:5])
Error in cor(x[, 4:5]) : missing observations in cov/cor In addition: Warning message: NAs introduced by coercion From: "Martin Maechler" <maechler@stat.math.ethz.ch> To: "Mayeul KAUFFMANN" <mayeul.kauffmann@tiscali.fr>
Mayeul> #I found the obvious workaround:
Mayeul> COR <- matrix(rep(0, 81),9,9)
Mayeul> for (i in 1:9) for (j in 1:9) {if (i>j) COR[i,j] <- cor
(x[,i],x[,j])}
Mayeul> #which works fine, with no warning
Mayeul> #looks like a "cor()" bug.
Martin Maechler wrote:
quite improbably.
if it is wrong, can you say what is wrong then propose an alternate workaround? (or should I ask on r-help).
What does
sapply(x, function(u)all(is.finite(u)))
return ?
sapply(x2, function(u)all(is.finite(u)))
jntdem smldepnp lrgdepnp contigkb logdstab majdyds alliesr lncaprt
GATT
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TRUE
_______________________________________________
But I now got the explanation. It is not due to size.
#Tony Plate wrote:
#I would suspect that your dataframe has columns that result in NA's when
it
#is coerced to a matrix
That's not yet the explanation, but you are close to it.
All columns are numerics, except 3 that are logical (I thought they would
be coerced to 0 an 1, which they are with cor(x[,4],x[,5]) not with
cor(x[,4:5]) )
They do not changes to NA's or infinite values, they ALL change to TEXT
?as.matrix
'as.matrix' is a generic function. The method for data frames will
convert any non-numeric/complex column into a character vector
using 'format' and so return a character matrix, except that
all-logical data frames will be coerced to a logical matrix.
as.matrix(x[1:3,1:9])
jntdem smldepnp lrgdepnp contigkb logdstab majdyds alliesr
1 "400" "0.01420874" "0.2156945" "TRUE" "5.820108" "TRUE" "TRUE"
2 "400" "0.01534535" "0.2496879" "TRUE" "5.820108" "TRUE" "TRUE"
3 "400" "0.01585586" "0.2570493" "TRUE" "5.820108" "TRUE" "TRUE"
lncaprt GATT
1 "2.883204" "1"
2 "2.906521" "1"
3 "2.833357" "1"
?cor says it accepts data.frame. In fact, it does iff they have no (or
only: cor(x[,6:7]) works) logical columns.
doing cor with a logical (a dummy variable) and a numeric is maybe not as
sensible as doing it with 2 numerics.
But it may still usefull to explore data.
Maybe one may want either to change the documentation of ?cor , or not
rely on as.matrix to convert the data.frame if some columns are logical.
Cheers,
Mayeul
______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Thu, 16 Sep 2004, Mayeul KAUFFMANN claimed:
?cor says it accepts data.frame. In fact, it does iff they have no (or
It actually says
x: a numeric vector, matrix or data frame.
^^^^^^^
If you want to do the conversions as you say, you should be calling
data.matrix.
@@@@@@@@@@@@@@@@@@@@@@@@
Thanks a lot !!!
When reading it first , I mistranslated it in my mind in a phrase that
would mean
"a numeric vector, a matrix or a data frame." (I'm not a native English
speaker). Sorry for all that stuff....
*But* let's admit that the two followings are not treated identically:
cor(x[,4],x[,5])
cor(x[,4:5])
in the first case, the non-numeric vector is transformed to a numeric one
in the second case, the (partially) non-numeric dataframe is not
transformed to a numeric one
To be more exact,
the doc should not say
x: a numeric vector, matrix or data frame.
^^^^^^^
but
x: a vector that can be coerced to numeric, a numeric matrix or a
numeric data frame.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^
^^^^^^^
Cheers,
Mayeul
PS:
by the way, if someones changes the doc,
the claim 'The default is equivalent to 'y = x' (but more efficient).'
is inexact as evidenced by the following:
X <- (data.frame(x=rep(1,5),y=1:5))
cor(X,X)
x y x NA NA y NA 1 Warning message: The standard deviation is zero in: cor(x, y, na.method, method == "kendall")
cor(X)
x y x 1 NA y NA 1 Warning message: The standard deviation is zero in: cor(x, y, na.method, method == "kendall")
We do not in general say things like `can be coerced'. It's taken for granted, and hard to be precise (your phrase is not precise, for there are non-numeric matrices that will be coerced, too). We do expect what is stated as valid input to work, and do encourage users to coerce objects themselves.
On Thu, 16 Sep 2004, Mayeul KAUFFMANN wrote:
On Thu, 16 Sep 2004, Mayeul KAUFFMANN claimed:
?cor says it accepts data.frame. In fact, it does iff they have no (or
It actually says
x: a numeric vector, matrix or data frame.
^^^^^^^
If you want to do the conversions as you say, you should be calling
data.matrix.
@@@@@@@@@@@@@@@@@@@@@@@@
Thanks a lot !!!
When reading it first , I mistranslated it in my mind in a phrase that
would mean
"a numeric vector, a matrix or a data frame." (I'm not a native English
speaker). Sorry for all that stuff....
*But* let's admit that the two followings are not treated identically:
cor(x[,4],x[,5])
cor(x[,4:5])
Where does it say that would be?
in the first case, the non-numeric vector is transformed to a numeric one
in the second case, the (partially) non-numeric dataframe is not
transformed to a numeric one
To be more exact,
the doc should not say
x: a numeric vector, matrix or data frame.
^^^^^^^
but
x: a vector that can be coerced to numeric, a numeric matrix or a
numeric data frame.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^
^^^^^^^
Cheers,
Mayeul
PS:
by the way, if someones changes the doc,
the claim 'The default is equivalent to 'y = x' (but more efficient).'
is inexact as evidenced by the following:
X <- (data.frame(x=rep(1,5),y=1:5))
cor(X,X)
x y x NA NA y NA 1 Warning message: The standard deviation is zero in: cor(x, y, na.method, method == "kendall")
cor(X)
x y x 1 NA y NA 1 Warning message: The standard deviation is zero in: cor(x, y, na.method, method == "kendall")
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595