Skip to content

Surprising behavior of letters[c(NA, NA)]

3 messages · Radford Neal, Duncan Murdoch, William Dunlap

#
Duncan Murdoch writes:

  The relevant quote is in the Language Definition, talking about
  indices by type of index:

  "Logical. The indexing i should generally have the same length as
  x. If it is shorter, then its elements will be recycled as discussed
  in Section 3.3 [Elementary arithmetic operations], page 14. If it is
  longer, then x is conceptually extended with NAs. The selected values
  of x are those for which i is TRUE."

But this certainly does not justify the actual behaviour.  It says
that, for example, (1:3)[NA] should not be a vector of three NAs, but
rather a vector of length zero - since NONE of the indexes are TRUE.

The actual behaviour of NA in a logical index makes no sense.  It
makes sense that NA in an integer index produces an NA in the result,
since this NA might correctly express the uncertainty in the value at
this position that follows from the uncertainty in the index (and
hence produce sensible results in subsequent operations).  But NA in a
logical index should lead to a result that is of uncertain length.
However, R has no mechanism for expressing such uncertainty, so it
makes more sense that NA in a logical index should produce an error.

   Radford Neal
#
On 18/12/2010 9:12 AM, Radford Neal wrote:
I agree that the behaviour is not particularly obvious, but I'm not so 
sure it should produce an error.  We should get an error when the input 
is likely to be accidental or due to a misconception and the output 
could be accepted and lead to wrong results later.  I think using an NA 
in a logical index is probably due to a misconception (e.g. thinking it 
is an NA_integer_), but the results are so weird that they are unlikely 
to pass unnoticed.

And presumably whoever chose this behaviour back in the ancient past 
thought there was some use in including NA in a logical index, and 
someone out there in the real world has made use of it.

But I wouldn't object if R version 3 gave errors for logical index 
vectors that were the wrong length or that contained NAs.

Duncan Murdoch
#
I'm agnostic at this point about the recycling rules
for logical subscripting, but I've been coming around
to thinking that x[logicalSubscript] should only return
the values of x such that the corresponding value
of logicalSubscript is TRUE.  Values in logicalSubscript
of NA and FALSE should be treated the same: the
corresponding value of x should not be put into the
output subset.  I know this change will break low-level
tests, but I suspect it will make more user-written
code start working than to start breaking when logical
NA's are used in subscripts.

I've asked people why they use the idiom
    x[which(condition)]
instead of the simpler
    x[condition]
Some new users don't know the simpler one works, but
more experienced users say they use which() because
it treats NA's the same a FALSES.

I've heard the same response when I ask about using
subset(dataFrame,condition) instead of dataFrame[condition,].
subset() uses non-standard evaluation rules that
leads to convoluted code involving substitute() and the
like when you want to use it in a general function,
but people use it in part because it treats logical NA's
as though they were FALSE's.

I sometimes use an is.true() function in subscript expressions
   is.true <- function(x) !is.na(x) & x
   vec[is.true(condition)]  
but it seems like a waste of time (and it gets confused
with the isTRUE() function).

A separate but related point is that using logical NA as a
subscript to vector with names gives a nonoptimal result:
  > c(one=1,two=2,three=3,four=4)[c(TRUE,NA,FALSE,TRUE)]
   one <NA> four 
     1   NA    4  
Why isn't the second element of the result called "two"?
As it stands we only know that there was an NA in the subscript,
somewhere between the first and fourth element.

An unrelated point concerns the builtin constants NA, NA_integer_,
NA_real_, etc, where all modes of NA's are generally printed as
just NA.  This seems to lead to misleading tests, like
   x[NA] # integer or logical NA?
but when analyzing data I think you rarely use a typed-in NA
in an expression.  The NA's generally come from data you have
read in from an external source (where NA is rarely used to
indicate missing values) and you always have to make sure
the data columns of imported data have the expected types.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com