Skip to content

table(exclude = NULL) always includes NA

4 messages · Suharto Anggono Suharto Anggono, Martin Maechler

#
useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany"

An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :
x <- c(1,2,3,3,NA)
table(as.integer(x), exclude=NaN)

I bring the example up, in case that the change in result is not intended.
--------------------------------------------
On Sat, 13/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Subject: Re: [Rd] table(exclude = NULL) always includes NA
 To: "Martin Maechler" <maechler at stat.math.ethz.ch>

@r-project.org
 Date: Saturday, 13 August, 2016, 4:29 AM
>> I stand corrected. The part "If set to 'NULL', it implies
    >> 'useNA="always"'." is even in the documentation in R
    >> 2.8.0. It was my fault not to check carefully.  I wonder,
    >> why "always" was chosen for 'useNA' for exclude = NULL.

    > me too.  "ifany" would seem more logical, and I am
    > considering changing to that as a 2nd step (if the 1st
    > step, below) shows to be feasible.

    >> Why exclude = NULL is so special? What about another
    >> 'exclude' of length zero, like character(0) (not c(),
    >> because c() is NULL)? I thought that, too. But then, I
    >> have no opinion about making it general.

    > As mentioned, I entirely agree with that {and you are
    > right about c() !!}.

    >> It fits my expectation to override 'useNA' only if the
    >> user doesn't explicitly specify 'useNA'.

    >> Thank you for looking into this.

    > you are welcome.  As first step, I plan to commit the
    > change to (*)

    >  useNA <- if (missing(useNA) && !missing(exclude) && !(NA
    > %in% exclude)) "always"

    > as proposed yesterday, and I'll eventually see / be
    > notified about the effect in CRAN space.

and as I'm finding now,  20 minutes too late,   doing step 1
without doing step 2  is not feasible.
It gives many  0 counts for <NA>  e.g. for  exclude = "foo".



    > --
    > (*) slightly more efficiently, I'll be using match()
    > directly instead of %in%

    >> My points: Could R 2.7.2 behavior of table(<non-factor>,
    >> exclude = NULL) be brought back? But R 3.3.1 behavior is
    >> in R since version 2.8.0, rather long.

    > you are right... but then, the places / cases where the
    > behavior would change back should be quite rare.

    >> If not, I suggest changing summary(<logical>).
    >> --------------------------------------------

    > Thank you for your feedback, Suharto!  Martin

    >> On Thu, 11/8/16, Martin Maechler
>> <maechler at stat.math.ethz.ch> wrote:
>> 
    >> Subject: Re: [Rd] table(exclude = NULL) always includes
    >> NA
    >> 
    >> @r-project.org Cc: "Martin Maechler"
    >> <maechler at stat.math.ethz.ch> Date: Thursday, 11 August,
    >> 2016, 12:39 AM
    >> 
    >> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>
    >> on Tue, 9 Aug 2016 15:35:41 +0200 writes:
    >> 
    >> >>>>> Suharto Anggono Suharto Anggono via R-devel
    >> <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19
    >> +0000 writes:
    >> 
    >> > > This is an example from
    >> https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html
    >> .
    >> > 
    >> > > With R 2.7.2:
    >> > 
    >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
    >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1
    >> 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0
    >> > 
    >> > > With R 3.3.1:
    >> > 
    >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1)
    >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 <NA> >
    >> > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > >
    >> table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 >
    >> > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > table(a, b, exclude
    >> = NULL, useNA = "ifany") > > b > > a 1 2 <NA> > > 1 1 1 0
    >> > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0
    >> > 
    >> > > For the example, in R 3.3.1, the result of 'table'
    >> with > > exclude = NULL includes NA even if NA is not
    >> present. It is > > different from R 2.7.2, that comes
    >> from factor(exclude = NULL), > > that includes NA only if
    >> NA is present.
    >> > 
    >> > I agree that this (R 3.3.1 behavior) seems undesirable
    >> and looks > wrong, and the old (<= 2.2.7) behavior for
    >> table(a,b, > exclude=NULL) seems desirable to me.
    >> > 
    >> > 
    >> > > >From R 3.3.1 help on 'table', in "Details" section:
    >> > > 'useNA' controls if the table includes counts of 'NA'
    >> values: the allowed values correspond to never, only if
    >> the count is positive and even for zero counts.  This is
    >> overridden by specifying 'exclude = NULL'.
    >> > 
    >> > > Specifying 'exclude = NULL' overrides 'useNA' to what
    >> value? The documentation doesn't say. Looking at the code
    >> of function 'table', the value is "always".
    >> > 
    >> > Yes, it should be documented what happens for this
    >> case, > (but read on ...)
    >> 
    >> and it is *not* true that the documentation does not say,
    >> since 2013, it has contained
    >> 
    >> exclude: levels to remove for all factors in ?...?.  If
    >> set to ?NULL?, it implies ?useNA = "always"?.  See
    >> ?Details? for its interpretation for non-factor
    >> arguments.
    >> 
    >> 
    >> > > For the example, in R 3.3.1, the result like in R
    >> 2.7.2 can be obtained with useNA = "ifany" and 'exclude'
    >> unspecified.
    >> > 
    >> > Yes.  What should we do?  > I currently think that we'd
    >> want to change the line
    >> > 
    >> > useNA <- if (!missing(exclude) && is.null(exclude))
    >> "always"
    >> > 
    >> > to
    >> > 
    >> > useNA <- if (!missing(exclude) && is.null(exclude))
    >> "ifany" # was "always"
    >> > 
    >> > 
    >> > which would not even contradict documentation, as
    >> indeed you > mentioned above, the exact action here had
    >> not been documented.
    >> 
    >> The last part ("which ..") above is wrong, as mentioned
    >> earlier.
    >> 
    >> The above change entails behaviour which looks better to
    >> me; however, the change *is* "against the current
    >> documentation".  and after experimentation (a "complete
    >> factorial design" of argument settings), I'm not entirely
    >> happy with the result.... and one reason is that 'exclude
    >> = NULL' and (e.g.)  'exclude = c()' are (still) handled
    >> differently: From a usual interpreation, both should mean
    >> "do not exclude any factor entries (and levels) from
    >> tabulation" but one of the two changes the default of
    >> 'useNA' and the other does not.  If we want a change
    >> anyway (and have to update the doc), it could be "more
    >> logical" to replace the line above by
    >> 
    >> useNA <- if (missing(useNA) && !missing(exclude) && !(NA
    >> %in% exclude)) "always"
    >> 
    >> notably, replacing 'useNA' *only* if it has not been
    >> specified, which seems much closer to "typically
    >> expected" behavior..
    >> 
    >> 
    >> 
    >> 
    >> > The change above at least does not break any of the
    >> standard R > tests ('make check-all', i.e., including the
    >> recommended > packages), which for me confirms that it
    >> may be "what is > best"...
    >> > 
    >> > ----
    >> > 
    >> > Thank you for mentioning the important consequence for
    >> summary(<logical>).  > They can helping insight what a
    >> "probably best" behavior should > be for these cases of
    >> table().
    >> > 
    >> > Martin Maechler, > ETH Zurich
    >> > 
    >> > > The result of 'summary' of a logical vector is
    >> affected. As mentioned in
    >> http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels
    >> , in the code of function 'summary.default', for logical,
    >> table(object, exclude = NULL) is used.
    >> > 
    >> > > With R 2.7.2:
    >> > 
    >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > >
    >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 >
    >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE > >
    >> logical 4 2 > > > summary(TRUE) > > Mode TRUE > > logical
    >> 1
    >> > 
    >> > > With R 3.3.1:
    >> > 
    >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > >
    >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 >
    >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE NA's >
    >> > logical 4 2 0 > > > summary(TRUE) > > Mode TRUE NA's >
    >> > logical 1 0
    >> > 
    >> > > In R 3.3.1, "NA's' is always in the result of
    >> 'summary' of a logical vector. It is unlike 'summary' of
    >> a numeric vector.  > > On the other hand, in R 3.3.1,
    >> FALSE is not in the result of 'summary' of a logical
    >> vector that doesn't contain FALSE.
    >> > 
    >> > > I prefer the result of 'summary' of a logical vector
    >> like in R 2.7.2, or, alternatively, the result that
    >> always includes all possible values: FALSE, TRUE, NA.
    >> > 
    >> > I tend to agree, and strongly prefer the
    >> 'R(<=2.7.2)'-behavior > for table() {and hence
    >> summary(<logical>)}.

    >> 

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
#
> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany"
    > An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :

    > x <- c(1,2,3,3,NA)
    > table(as.integer(x), exclude=NaN)

    > I bring the example up, in case that the change in result is not intended.

Thanks a lot, Suharto.

To me, the example is convincing that the change (I commited
Friday), svn rev 71087 & 71088,   are a clear improvement:

(As you surely know, but not all the other readers:)
Before the change, the above example gave *different* results
for  'x'  and  'as.integer(x)', the integer case *not* counting the NAs,
whereas with the change in effect, they are the same:
x
   1    2    3 <NA> 
   1    1    2    1 
dx
   1    2    3 <NA> 
   1    1    2    1
--
But the change has affected 6-8 (of the 8000+) CRAN packages
which I am investigating now and probably will be in contact with the
package maintainers after that.

Martin
#
>> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany"
    >> An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :

    >> x <- c(1,2,3,3,NA)
    >> table(as.integer(x), exclude=NaN)

    >> I bring the example up, in case that the change in result is not intended.

    > Thanks a lot, Suharto.

    > To me, the example is convincing that the change (I commited
    > Friday), svn rev 71087 & 71088,   are a clear improvement:

    > (As you surely know, but not all the other readers:)
    > Before the change, the above example gave *different* results
    > for  'x'  and  'as.integer(x)', the integer case *not* counting the NAs,
    > whereas with the change in effect, they are the same:

    >> x <- as.integer(dx <- c(1,2,3,3,NA))
    >> table(x, exclude=NaN); table(dx, exclude=NaN)
    > x
    > 1    2    3 <NA> 
    > 1    1    2    1 
    > dx
    > 1    2    3 <NA> 
    > 1    1    2    1 
    >> 

    > --
    > But the change has affected 6-8 (of the 8000+) CRAN packages
    > which I am investigating now and probably will be in contact with the
    > package maintainers after that.

There has been another bug in table(), since the time  'useNA'
was introduced, which gives (in released R, R-patched, or R-devel):

  > table(1:3, exclude = 1, useNA = "ifany")

     2    3 <NA> 
     1    1    1 
  >

and that bug now (in R-devel, after my changes) triggers in
cases it did not previously, notably in
 
    table(1:3, exclude = 1)

which now does set 'useNA = "ifany"' and so gives the same silly
result as the one above.

The reason for this bug is that   addNA(..)  is called (in all R
versions mentioned) in this case, but it should not.

I'm currently testing yet another amendment..

Martin
1 day later
#
>>> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany"
    >>> An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :

    >>> x <- c(1,2,3,3,NA)
    >>> table(as.integer(x), exclude=NaN)

    >>> I bring the example up, in case that the change in result is not intended.

    >> Thanks a lot, Suharto.

    >> To me, the example is convincing that the change (I commited
    >> Friday), svn rev 71087 & 71088,   are a clear improvement:

    >> (As you surely know, but not all the other readers:)
    >> Before the change, the above example gave *different* results
    >> for  'x'  and  'as.integer(x)', the integer case *not* counting the NAs,
    >> whereas with the change in effect, they are the same:

    >>> x <- as.integer(dx <- c(1,2,3,3,NA))
    >>> table(x, exclude=NaN); table(dx, exclude=NaN)
    >> x
    >> 1    2    3 <NA> 
    >> 1    1    2    1 
    >> dx
    >> 1    2    3 <NA> 
    >> 1    1    2    1 
    >>> 

    >> --
    >> But the change has affected 6-8 (of the 8000+) CRAN packages
    >> which I am investigating now and probably will be in contact with the
    >> package maintainers after that.

    > There has been another bug in table(), since the time  'useNA'
    > was introduced, which gives (in released R, R-patched, or R-devel):

    >> table(1:3, exclude = 1, useNA = "ifany")

    > 2    3 <NA> 
    > 1    1    1 
    >> 

    > and that bug now (in R-devel, after my changes) triggers in
    > cases it did not previously, notably in
 
    > table(1:3, exclude = 1)

    > which now does set 'useNA = "ifany"' and so gives the same silly
    > result as the one above.

    > The reason for this bug is that   addNA(..)  is called (in all R
    > versions mentioned) in this case, but it should not.

    > I'm currently testing yet another amendment..

which was not sufficient... so I had to do *much* more work.

The result is code which functions -- I hope -- uniformly better
than the current code, but unfortunately, code that is much longer.

After all I came to the conclusion that using addNA() was not
good enough [I did not yet consider *changing* addNA() itself,
even though the only place we use it in R's own packages is
inside table()] and so for now have code in table() that does
the equivalent of addNA() *but* does remember if addNA() did add
an NA level or not.
I also have extended the regression tests considerably,
*and*  example(table)  now reverts to give identical output to
R 3.3.1 (which it did no longer in R-devel (r 71088)).

I'm still investigating the CRAN package fallout (from the above
change 4 days ago) but plan to commit my (unfortunately
somewhat extensive) changes.

Also, I think this will become the first in this year's R-devel

SIGNIFICANT USER-VISIBLE CHANGES:

  ? ?table()? has been amended to be more internally consistent
    and become back compatible to R <= 2.7.2 again.
    Consequently, ?table(1:2, exclude=NULL)? no longer contains
    a zero count for ?<NA>?, but ?useNA = "always"? continues to
    do so.

--
Martin