Skip to content
Prev 38686 / 63424 Next

Surprising behavior of letters[c(NA, NA)]

I'm agnostic at this point about the recycling rules
for logical subscripting, but I've been coming around
to thinking that x[logicalSubscript] should only return
the values of x such that the corresponding value
of logicalSubscript is TRUE.  Values in logicalSubscript
of NA and FALSE should be treated the same: the
corresponding value of x should not be put into the
output subset.  I know this change will break low-level
tests, but I suspect it will make more user-written
code start working than to start breaking when logical
NA's are used in subscripts.

I've asked people why they use the idiom
    x[which(condition)]
instead of the simpler
    x[condition]
Some new users don't know the simpler one works, but
more experienced users say they use which() because
it treats NA's the same a FALSES.

I've heard the same response when I ask about using
subset(dataFrame,condition) instead of dataFrame[condition,].
subset() uses non-standard evaluation rules that
leads to convoluted code involving substitute() and the
like when you want to use it in a general function,
but people use it in part because it treats logical NA's
as though they were FALSE's.

I sometimes use an is.true() function in subscript expressions
   is.true <- function(x) !is.na(x) & x
   vec[is.true(condition)]  
but it seems like a waste of time (and it gets confused
with the isTRUE() function).

A separate but related point is that using logical NA as a
subscript to vector with names gives a nonoptimal result:
  > c(one=1,two=2,three=3,four=4)[c(TRUE,NA,FALSE,TRUE)]
   one <NA> four 
     1   NA    4  
Why isn't the second element of the result called "two"?
As it stands we only know that there was an NA in the subscript,
somewhere between the first and fourth element.

An unrelated point concerns the builtin constants NA, NA_integer_,
NA_real_, etc, where all modes of NA's are generally printed as
just NA.  This seems to lead to misleading tests, like
   x[NA] # integer or logical NA?
but when analyzing data I think you rarely use a typed-in NA
in an expression.  The NA's generally come from data you have
read in from an external source (where NA is rarely used to
indicate missing values) and you always have to make sure
the data columns of imported data have the expected types.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com