table(exclude = NULL) always includes NA

3 messages · Suharto Anggono Suharto Anggono, Martin Maechler

Sun, Aug 7, 2016 8:32 AM #

This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .

With R 2.7.2:

b
a      1 2
  1    1 1
  2    2 0
  3    1 0
  <NA> 1 0

With R 3.3.1:

b
a      1 2 <NA>
  1    1 1    0
  2    2 0    0
  3    1 0    0
  <NA> 1 0    0

b
a      1 2 <NA>
  1    1 1    0
  2    2 0    0
  3    1 0    0
  <NA> 1 0    0

For the example, in R 3.3.1, the result of 'table' with exclude = NULL includes NA even if NA is not present. It is different from R 2.7.2, that comes from factor(exclude = NULL), that includes NA only if NA is present.

'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts.  This is overridden by specifying 'exclude = NULL'.

Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".

For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified.


The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.

With R 2.7.2:

Mode   FALSE    TRUE    NA's
logical       4       2       3

Mode   FALSE    TRUE
logical       4       2

Mode    TRUE
logical       1

With R 3.3.1:

Mode   FALSE    TRUE    NA's
logical       4       2       3

Mode   FALSE    TRUE    NA's
logical       4       2       0

Mode    TRUE    NA's
logical       1       0

In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector.
On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't  contain FALSE.

I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.

1 day later

Martin Maechler

Tue, Aug 9, 2016 6:35 AM #

I agree that this (R 3.3.1 behavior) seems undesirable and looks
wrong, and the old (<= 2.2.7) behavior for  table(a,b,
exclude=NULL) seems desirable to me.

Yes, it should be documented what happens for this case,
(but read on ...)

Yes.  What should we do?
I currently think that we'd want to change the line

     useNA <- if (!missing(exclude) && is.null(exclude)) "always"

to

     useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always"


which would not even contradict documentation, as indeed you
mentioned above, the exact action here had not been documented.

The change above at least does not break any of the standard R
tests ('make check-all', i.e., including the recommended
packages), which for me confirms that it may be "what is
best"...

----

Thank you for mentioning the important consequence for summary(<logical>).
They can helping insight what a "probably best" behavior should
be for these cases of table().

Martin Maechler,
ETH Zurich

I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior
for table() {and hence summary(<logical>)}.

1 day later

Martin Maechler

Wed, Aug 10, 2016 10:39 AM #

Martin Maechler <maechler at stat.math.ethz.ch>
    on Tue, 9 Aug 2016 15:35:41 +0200 writes:

Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org>
    on Sun, 7 Aug 2016 15:32:19 +0000 writes:

This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .

With R 2.7.2:

a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
table(a, b, exclude = NULL)

With R 3.3.1:

a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
table(a, b, exclude = NULL)

      b
a      1 2 <NA>
  1    1 1    0
  2    2 0    0
  3    1 0    0
  <NA> 1 0    0

table(a, b, useNA = "ifany")

table(a, b, exclude = NULL, useNA = "ifany")

      b
a      1 2 <NA>
  1    1 1    0
  2    2 0    0
  3    1 0    0
  <NA> 1 0    0

For the example, in R 3.3.1, the result of 'table' with
exclude = NULL includes NA even if NA is not present. It is
different from R 2.7.2, that comes from factor(exclude = NULL), 
that includes NA only if NA is present.

I agree that this (R 3.3.1 behavior) seems undesirable and looks
wrong, and the old (<= 2.2.7) behavior for  table(a,b,
exclude=NULL) seems desirable to me.

From R 3.3.1 help on 'table', in "Details" section:

'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts.  This is overridden by specifying 'exclude = NULL'.

Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".

Yes, it should be documented what happens for this case,
(but read on ...)

and it is *not* true that the documentation does not say, since
2013, it has contained

 exclude: levels to remove for all factors in ?...?.  If set to ?NULL?,
          it implies ?useNA = "always"?.  See ?Details? for its
          interpretation for non-factor arguments.

The last part ("which ..") above is wrong, as mentioned earlier.

The above change entails behaviour which looks better to me;
however, the change *is* "against the current documentation".
and after experimentation (a "complete factorial design" of
argument settings), I'm not entirely happy with the result.... and one reason
is that   'exclude = NULL'  and  (e.g.)   'exclude = c()'
are (still) handled differently: From a usual interpreation,
both should mean 
  "do not exclude any factor entries (and levels) from tabulation"
but one of the two changes the default of 'useNA' and the other
does not.   If we want a change anyway (and have to update the doc),
it could be "more logical"  to replace the line above by

   useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always"

notably, replacing 'useNA' *only* if it has not been specified,
which seems much closer to "typically expected" behavior..

The change above at least does not break any of the standard R
tests ('make check-all', i.e., including the recommended
packages), which for me confirms that it may be "what is
best"...

----

Thank you for mentioning the important consequence for summary(<logical>).
They can helping insight what a "probably best" behavior should
be for these cases of table().

Martin Maechler,
ETH Zurich

The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.

With R 2.7.2:

log <- c(NA, logical(4), NA, !logical(2), NA)
summary(log)

   Mode   FALSE    TRUE    NA's
logical       4       2       3

summary(log[!is.na(log)])

   Mode   FALSE    TRUE
logical       4       2

summary(TRUE)

   Mode    TRUE
logical       1

With R 3.3.1:

log <- c(NA, logical(4), NA, !logical(2), NA)
summary(log)

   Mode   FALSE    TRUE    NA's
logical       4       2       3

summary(log[!is.na(log)])

   Mode   FALSE    TRUE    NA's
logical       4       2       0

summary(TRUE)

   Mode    TRUE    NA's
logical       1       0

In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector.
On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't  contain FALSE.

I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.

I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior
for table() {and hence summary(<logical>)}.