Skip to content

Incorrect handling of NA's in cor() (PR#6750)

4 messages · Thomas Lumley, msa@biostat.mgh.harvard.edu, Peter Dalgaard

#
Dear Uwe,

You are wrong. First, I've read the help file before
submitting the report. For two variables,
use="pairwise.complete.obs" and use="complete.obs" should be
equivalent, shouldn't it? Of sourse, the results will be
different when we have more than 2 variables. Second, with the
call you proposed I am also getting incorrect result:
[1] -0.1428571

The correct result is -0.4, as correctly calculated by
cor.test()

Regards

Marek Ancukiewicz
#
On Fri, 9 Apr 2004 msa@biostat.mgh.harvard.edu wrote:

            
I think it's more complicated than either of you are considering.

For the Pearson correlation everything is straightforward, and
pairwise.complete is the same as complete, which is the same as dropping
the NAs manually.

For the rank correlations the question is when the ranking should be done.
The cor() function ranks the observations and then drops missing values,
the manual approach drops missing values and then ranks.

I'm not convinced that it is obvious which of these is right, though
certainly the help page should document whichever is being done.


	-thomas
#
Dear Thomas,

The question becomes: how do we rank missing values?  In
version 1.8.1 at least, cor () uses default handling of
missing values by rank() [by na.last parameter], that is
missing values are assigned the highest rank. However, if
nothing is known about the meaning of NA what would be the
basis of such an assumption?  Assigning the NAs highest,
lowest values, or any other values requires some additional
information.

It seems that the default handling on missing values should be
to assign them missing ranks: within cor(), rank() should be
called with na.last="keep". However, cor() could have an
additional parameter, such as na.rank which would allow to
account for known ranking of missing values, and which would 
be passed to rank()

By the way, if this were possible [and probably it isn't
because of compatibility with Splus] I would change, in rank()
the naming of "na.last" parameter to "na.rank" with values
such as "last", "first","remove", and "na". That would seem
easier to remember. Also, perhaps the default value should be
"na".

Regards,

Marek
#
Marek Ancukiewicz <msa@biostat.mgh.harvard.edu> writes:
Yes, and that is what 1.9.0beta is doing (it's not like this issue
hasn't been brought up before, just that the fix didn't quite fix it).
I think what we have now is still buggy, but at least it isn't biasing
rho towards +1 whenever x and y tend to be both missing at the same
time.

It's fairly easy to do something more sensible in the complete.cases
case, but getting pairwise.complete.cases right is tricky. 1.9.0
is in deep code freeze, so I don't think we should change things at
this point, except perhaps add a note to the help page.