Question re: NA, NaNs in R

This isn't quite what you were asking, but might inform your choice.

R doesn't try to maintain the distinction between NA and NaN when
doing calculations, e.g.:
NA + NaN
[1] NA
NaN + NA
[1] NaN
So for the aggregate package, I didn't attempt to treat them differently.

The aggregate package is available at
http://www.timhesterberg.net/r-packages

Here is the inst/doc/missingValues.txt file from that package:

--------------------------------------------------
Copyright 2012 Google Inc. All Rights Reserved.
Author: Tim Hesterberg <rocket at google.com>
Distributed under GPL 2 or later.

	Handling of missing values and not-a-numbers.

Here I'll note how this package handles missing values.
I do it the way R handles them, rather than the more strict way that S+ does.

First, for terminology,
  NaN = "not-a-number", e.g. the result of 0/0
  NA  = "missing value" or "true missing value", e.g. survey non-response
  xx  = I'll uses this for the union of those, or "missing value of any kind".

For background, at the hardware level there is an IEEE standard that
specifies that certain bit patterns are NaN, and specifies that
operations involving an NaN result in another NaN.

That standard doesn't say anything about missing values, which are
important in statistics.

So what R and S+ do is to pick one of the bit patterns and declare
that to be a NA.  In other words, the NA bit pattern is a subset of
the NaN bit patterns.

At the user level, the reverse seems to hold.
You can assign either NA or NaN to an object.
But:
	is.na(x) returns TRUE for both
	is.nan(x) returns TRUE for NaN and FALSE for NA
Based on that, you'd think that NaN is a subset of NA.
To tell whether something is a true missing value do:
	(is.na(x) & !is.nan(x))

The S+ convention is that any operation involving NA results in an NA;
otherwise any operation involving NaN results in NaN.

The R convention is that any operation involving xx results in an xx;
a missing value of any kind results in another missing value of any
kind.  R considers NA and NaN equivalent for testing purposes:
	all.equal(NA_real_, NaN)
gives TRUE.

Some R functions follow the S+ convention, e.g. the Math2 functions
in src/main/arithmetic.c use this macro:
#define if_NA_Math2_set(y,a,b)				\
	if      (ISNA (a) || ISNA (b)) y = NA_REAL;	\
	else if (ISNAN(a) || ISNAN(b)) y = R_NaN;

Other R functions, like the basic arithmetic operations +-/*^,
do not (search for PLUSOP in src/main/arithmetic.c).
They just let the hardware do the calculations.
As a result, you can get odd results like
is.nan(NA_real_ + NaN)
[1] FALSE
is.nan(NaN + NA_real_)
[1] TRUE

The R help files help(is.na) and help(is.nan) suggest that
computations involving NA and NaN are indeterminate.

It is faster to use the R convention; most operations are just
handled by the hardware, without extra work.

In cases like sum(x, na.rm=TRUE), the help file specifies that both NA
and NaN are removed.
There is one NA but mulitple NaNs.

And please re-read 'man memcmp': your cast is wrong.

On 10/02/2014 06:52, Kevin Ushey wrote:
Hi R-devel,

I have a question about the differentiation between NA and NaN values
as implemented in R. In arithmetic.c, we have

int R_IsNA(double x)
{
     if (isnan(x)) {
ieee_double y;
y.value = x;
return (y.word[lw] == 1954);
     }
     return 0;
}

ieee_double is just used for type punning so we can check the final
bits and see if they're equal to 1954; if they are, x is NA, if
they're not, x is NaN (as defined for R_IsNaN).

My question is -- I can see a substantial increase in speed (on my
computer, in certain cases) if I replace this check with

int R_IsNA(double x)
{
     return memcmp(
         (char*)(&x),
         (char*)(&NA_REAL),
         sizeof(double)
     ) == 0;
}

IIUC, there is only one bit pattern used to encode R NA values, so
this should be safe. But I would like to be sure:

Is there any guarantee that the different functions in R would return
NA as identical to the bit pattern defined for NA_REAL, for a given
architecture? Similarly for NaN value(s) and R_NaN?

My guess is that it is possible some functions used internally by R
might encode NaN values differently; ie, setting the lower word to a
value different than 1954 (hence being NaN, but potentially not
identical to R_NaN), or perhaps this is architecture-dependent.
However, NA should be one specific bit pattern (?). And, I wonder if
there is any guarantee that the different functions used in R would
return an NaN value as identical to R_NaN (which appears to be the
'IEEE NaN')?

(interested parties can see + run a simple benchmark from the gist at
https://gist.github.com/kevinushey/8911432)

Thanks,
Kevin

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

--
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Question re: NA, NaNs in R

Thread (10 messages)