Skip to content
Prev 12351 / 63461 Next

Incorrect handling of NA's in cor() (PR#6750)

Dear Thomas,

The question becomes: how do we rank missing values?  In
version 1.8.1 at least, cor () uses default handling of
missing values by rank() [by na.last parameter], that is
missing values are assigned the highest rank. However, if
nothing is known about the meaning of NA what would be the
basis of such an assumption?  Assigning the NAs highest,
lowest values, or any other values requires some additional
information.

It seems that the default handling on missing values should be
to assign them missing ranks: within cor(), rank() should be
called with na.last="keep". However, cor() could have an
additional parameter, such as na.rank which would allow to
account for known ranking of missing values, and which would 
be passed to rank()

By the way, if this were possible [and probably it isn't
because of compatibility with Splus] I would change, in rank()
the naming of "na.last" parameter to "na.rank" with values
such as "last", "first","remove", and "na". That would seem
easier to remember. Also, perhaps the default value should be
"na".

Regards,

Marek
Message-ID: <20040409180817.D6C275E197@biostat.mgh.harvard.edu>
In-Reply-To: <Pine.A41.4.58.0404091032330.61772@homer32.u.washington.edu> (message from Thomas Lumley on Fri, 9 Apr 2004 10:42:59 -0700 (PDT))