Coercion of 'exclude' in function 'factor' (was 'droplevels' inappropriate change) - R-devel

Fri, Sep 2, 2016 9:10 AM #

I am basically fine with the change.

How about using just the following?
    if(!is.character(exclude))
        exclude <- as.vector(exclude, typeof(x)) # may result in NA
    x <- as.character(x)

It looks simpler and is, more or less, equivalent.

In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed.


A larger change that, I think, is reasonable is entirely removing the code
    exclude <- as.vector(exclude, typeof(x)) # may result in NA

The explicit coercion of 'exclude' is not necessary. Function 'factor' works without it.

The coercion of 'exclude' may leads to a surprise because it "may result in NA". Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :
factor(as.integer(c(1,2,3,3,NA)), exclude=NaN)
excludes NA.

As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works.
cc <- c("x","y","NA")
ff <- factor(cc)
factor(ff,exclude=ff[3])

However, the coercion of 'exclude' has been in function 'factor' in R "forever".
--------------------------------------------

On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

Subject: Re: [Rd] 'droplevels' inappropriate change

 Cc: "Martin Maechler" <maechler at stat.math.ethz.ch>
 Date: Wednesday, 31 August, 2016, 2:51 PM

>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'.
    >> passed to factor(); factor levels which should be excluded from the result even if present.  Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation.  The current default is compatible with x[ , drop=FALSE].

    >> The part
    >> x[ , drop=FALSE]
    >> should be
    >> x[ , drop=TRUE]

[[elided Yahoo spam]]
    > a "typo" by me. .. fixed now.

    >> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result.

    >> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html .
    >> factor(factor(c("a","b","c")), exclude="c")

    >> However, this excludes "2":
    >> factor(factor(2:3), exclude=2)

    >> Rather unexpectedly, this excludes NA:
    >> factor(factor(c("a",NA), exclude=NULL), exclude="c")

    >> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html .

    > Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5
    > years ago) is confirming the problem there,  and suggesting (as
    > you, right?) that actually   `factor()` is not behaving
    > correctly here.

    > And your persistence is finally getting close to convince me
    > that it is not just droplevels(), but  factor() itself which
    > needs care here.

    > Interestingly, the following patch *does* pass 'make check-all'
    > (after small change in tests/reg-tests-1b.R which is ok),
    > and leads to behavior which is much closer to the documentation,
    > notably for your two examples above would give what one would
    > expect.

    > (( If the R-Hub would support experiments with branches of R-devel 
    > from R-core members,  I could just create such a branch and R Hub
    > would run 'R CMD check <pkg>'  for thousands of CRAN packages
    > and provide a web page with the *differences* in the package
    > check results ... so we could see ... ))

    > I do agree that we should strongly consider such a change.

as nobody has commented, I've been liberal and have taken these
no comments as consent.

Hence I have committed

------------------------------------------------------------------------
r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line
Changed paths:
   M /trunk/doc/NEWS.Rd
   M /trunk/src/library/base/R/factor.R
   M /trunk/src/library/base/man/factor.Rd
   M /trunk/tests/reg-tests-1b.R
   M /trunk/tests/reg-tests-1c.R

factor(x, exclude) more "rational" when x or exclude are character
------------------------------------------------------------------------

which apart from documentation, examples, and regression tests
is just the patch below.

Martin Maechler
ETH Zurich



    > --- factor.R    (revision 71157)
    > +++ factor.R    (working copy)
    > @@ -28,8 +28,12 @@
    > levels <- unique(y[ind])
    > }
    > force(ordered) # check if original x is an ordered factor
    > -    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    > -    x <- as.character(x)
    > +    if(!is.character(x)) {
    > +    if(!is.character(exclude))
    > +        exclude <- as.vector(exclude, typeof(x)) # may result in NA
    > +    x <- as.character(x)
    > +    } else
    > +    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    > ## levels could be a long vectors, but match will not handle that.
    > levels <- levels[is.na(match(levels, exclude))]
    > f <- match(x, levels)
 Delete Reply Reply All Forward Apply

Martin Maechler

Tue, Sep 13, 2016 9:33 AM #

> I am basically fine with the change.
    > How about using just the following?
    > if(!is.character(exclude))
    >   exclude <- as.vector(exclude, typeof(x)) # may result in NA
    > x <- as.character(x)

    > It looks simpler and is, more or less, equivalent.

yes, but the current code should be slightly faster..

    > In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed.


    > A larger change that, I think, is reasonable is entirely removing the code
    > exclude <- as.vector(exclude, typeof(x)) # may result in NA

    > The explicit coercion of 'exclude' is not necessary. 
    > Function 'factor' works without it.

    > The coercion of 'exclude' may lead to a surprise because it "may result in NA". 
    > Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :
    >    factor(as.integer(c(1,2,3,3,NA)), exclude=NaN)
    > excludes NA.

    > As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works.
    > cc <- c("x","y","NA")
    > ff <- factor(cc)
    > factor(ff,exclude=ff[3])

Yes, two good reasons for a change.  factor() would finally
behave according to the documentation which has been mentioning
that 'exclude' could be a factor: ((Until my R-devel changes of a
few weeks ago, i.e. at least in all recent released versions of R)),
the help page for factor has said

    || If 'exclude' is used it should either be a factor with the same
    || level set as 'x' or a set of codes for the levels to be excluded.

  > However, the coercion of 'exclude' has been in function 'factor' in R "forever".

Indeed: On March 6, 1998, svn rev. 845, when the R source files got a
'.R' appended, and quite a long time before  R 1.0.0,
the factor function was as short as (but using an .Internal(.) !)

function (x, levels = sort(unique(x), na.last = TRUE), labels, exclude = NA, 
	ordered = FALSE) 
{
	if (length(x) == 0) 
		return(character(0))
	exclude <- as.vector(exclude, typeof(x))
	levels <- levels[is.na(match(levels, exclude))]
	x <- .Internal(factor(match(x, levels), length(levels), 
		ordered))
	if (missing(labels)) 
		levels(x) <- levels
	else levels(x) <- labels
	x
}

and already contained that line.

Nevertheless, simplifying factor() by removing that line (or those
2 lines, now!) seems to only have advantages....

I'm not yet committing to anything, but currently would strongly
consider it .. though *after* the
	   '<length-1-array>  OP  <non-array>'
issue has settled a bit.

Martin

    > --------------------------------------------

> On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

> Subject: Re: [Rd] 'droplevels' inappropriate change

    > Cc: "Martin Maechler" <maechler at stat.math.ethz.ch>
    > Date: Wednesday, 31 August, 2016, 2:51 PM

>>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'.
    >>> passed to factor(); factor levels which should be excluded from the result even if present.  Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation.  The current default is compatible with x[ , drop=FALSE].

    >>> The part
    >>> x[ , drop=FALSE]
    >>> should be
    >>> x[ , drop=TRUE]

    > [[elided Yahoo spam]]
    >> a "typo" by me. .. fixed now.

    >>> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result.

    >>> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html .
    >>> factor(factor(c("a","b","c")), exclude="c")

    >>> However, this excludes "2":
    >>> factor(factor(2:3), exclude=2)

    >>> Rather unexpectedly, this excludes NA:
    >>> factor(factor(c("a",NA), exclude=NULL), exclude="c")

    >>> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html .

    >> Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5
    >> years ago) is confirming the problem there,  and suggesting (as
    >> you, right?) that actually   `factor()` is not behaving
    >> correctly here.

    >> And your persistence is finally getting close to convince me
    >> that it is not just droplevels(), but  factor() itself which
    >> needs care here.

    >> Interestingly, the following patch *does* pass 'make check-all'
    >> (after small change in tests/reg-tests-1b.R which is ok),
    >> and leads to behavior which is much closer to the documentation,
    >> notably for your two examples above would give what one would
    >> expect.

    >> (( If the R-Hub would support experiments with branches of R-devel 
    >> from R-core members,  I could just create such a branch and R Hub
    >> would run 'R CMD check <pkg>'  for thousands of CRAN packages
    >> and provide a web page with the *differences* in the package
    >> check results ... so we could see ... ))

    >> I do agree that we should strongly consider such a change.

    > as nobody has commented, I've been liberal and have taken these
    > no comments as consent.

    > Hence I have committed

    > ------------------------------------------------------------------------
    > r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line
    > Changed paths:
    > M /trunk/doc/NEWS.Rd
    > M /trunk/src/library/base/R/factor.R
    > M /trunk/src/library/base/man/factor.Rd
    > M /trunk/tests/reg-tests-1b.R
    > M /trunk/tests/reg-tests-1c.R

    > factor(x, exclude) more "rational" when x or exclude are character
    > ------------------------------------------------------------------------

    > which apart from documentation, examples, and regression tests
    > is just the patch below.

    > Martin Maechler
    > ETH Zurich



    >> --- factor.R    (revision 71157)
    >> +++ factor.R    (working copy)
    >> @@ -28,8 +28,12 @@
    >> levels <- unique(y[ind])
    >> }
    >> force(ordered) # check if original x is an ordered factor
    >> -    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >> -    x <- as.character(x)
    >> +    if(!is.character(x)) {
    >> +    if(!is.character(exclude))
    >> +        exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >> +    x <- as.character(x)
    >> +    } else
    >> +    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >> ## levels could be a long vectors, but match will not handle that.
    >> levels <- levels[is.na(match(levels, exclude))]
    >> f <- match(x, levels)
    > Delete Reply Reply All Forward Apply

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

Fri, Sep 30, 2016 5:51 AM #

>> I am basically fine with the change.
    >> How about using just the following?
    >> if(!is.character(exclude))
    >> exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >> x <- as.character(x)

    >> It looks simpler and is, more or less, equivalent.

    > yes, but the current code should be slightly faster..

    >> In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed.


    >> A larger change that, I think, is reasonable is entirely removing the code
    >> exclude <- as.vector(exclude, typeof(x)) # may result in NA

    >> The explicit coercion of 'exclude' is not necessary. 
    >> Function 'factor' works without it.

    >> The coercion of 'exclude' may lead to a surprise because it "may result in NA". 
    >> Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html :
    >> factor(as.integer(c(1,2,3,3,NA)), exclude=NaN)
    >> excludes NA.

    >> As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works.
    >> cc <- c("x","y","NA")
    >> ff <- factor(cc)
    >> factor(ff,exclude=ff[3])

    > Yes, two good reasons for a change.  factor() would finally
    > behave according to the documentation which has been mentioning
    > that 'exclude' could be a factor: ((Until my R-devel changes of a
    > few weeks ago, i.e. at least in all recent released versions of R)),
    > the help page for factor has said

    > || If 'exclude' is used it should either be a factor with the same
    > || level set as 'x' or a set of codes for the levels to be excluded.

    >> However, the coercion of 'exclude' has been in function 'factor' in R "forever".

    > Indeed: On March 6, 1998, svn rev. 845, when the R source files got a
    > '.R' appended, and quite a long time before  R 1.0.0,
    > the factor function was as short as (but using an .Internal(.) !)

    > function (x, levels = sort(unique(x), na.last = TRUE), labels, exclude = NA, 
    > 	ordered = FALSE) 
    > {
    > 	if (length(x) == 0) 
    > 		return(character(0))
    > 	exclude <- as.vector(exclude, typeof(x))
    > 	levels <- levels[is.na(match(levels, exclude))]
    > 	x <- .Internal(factor(match(x, levels), length(levels), 
    > 		ordered))
    > 	if (missing(labels)) 
    > 		levels(x) <- levels
    > 	else levels(x) <- labels
    > 	x
    > }

    > and already contained that line.

    > Nevertheless, simplifying factor() by removing that line (or those
    > 2 lines, now!) seems to only have advantages....

    > I'm not yet committing to anything, but currently would strongly
    > consider it .. though *after* the
    > '<length-1-array>  OP  <non-array>'
    > issue has settled a bit.

  (Which it has;  the decision has been to keep it.)

I have now committed Suharto's proposal above, to entirely drop the
	exclude <- as.vector(exclude, typeof(x))
parts
in the factor() function...  which has the two advantages
mentioned above and simplifies the code (and documentation).

------------------------------------------------------------------------
r71424 | maechler | 2016-09-30 14:38:43 +0200 (Fri, 30 Sep 2016) | 1 line
Changed paths:
   M /trunk/doc/NEWS.Rd
   M /trunk/src/library/base/R/factor.R
   M /trunk/src/library/base/man/factor.Rd
   M /trunk/tests/reg-tests-1c.R

simplify factor(), allowing 'exclude= <factor>' as documented in R <= 3.3.x
------------------------------------------------------------------------

I do expect some "reaction" in CRAN/Bioconductor package space,
so the final word has not been spoken on this, but the new code
is more aestethic to me.

Thank you for the suggestion,
Martin 


    >> --------------------------------------------

>> On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

>> Subject: Re: [Rd] 'droplevels' inappropriate change

    >> Cc: "Martin Maechler" <maechler at stat.math.ethz.ch>
    >> Date: Wednesday, 31 August, 2016, 2:51 PM

>>>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'.
    >>>> passed to factor(); factor levels which should be excluded from the result even if present.  Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation.  The current default is compatible with x[ , drop=FALSE].

    >>>> The part
    >>>> x[ , drop=FALSE]
    >>>> should be
    >>>> x[ , drop=TRUE]

    >> [[elided Yahoo spam]]
    >>> a "typo" by me. .. fixed now.

    >>>> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result.

    >>>> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html .
    >>>> factor(factor(c("a","b","c")), exclude="c")

    >>>> However, this excludes "2":
    >>>> factor(factor(2:3), exclude=2)

    >>>> Rather unexpectedly, this excludes NA:
    >>>> factor(factor(c("a",NA), exclude=NULL), exclude="c")

    >>>> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html .

    >>> Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5
    >>> years ago) is confirming the problem there,  and suggesting (as
    >>> you, right?) that actually   `factor()` is not behaving
    >>> correctly here.

    >>> And your persistence is finally getting close to convince me
    >>> that it is not just droplevels(), but  factor() itself which
    >>> needs care here.

    >>> Interestingly, the following patch *does* pass 'make check-all'
    >>> (after small change in tests/reg-tests-1b.R which is ok),
    >>> and leads to behavior which is much closer to the documentation,
    >>> notably for your two examples above would give what one would
    >>> expect.

    >>> (( If the R-Hub would support experiments with branches of R-devel 
    >>> from R-core members,  I could just create such a branch and R Hub
    >>> would run 'R CMD check <pkg>'  for thousands of CRAN packages
    >>> and provide a web page with the *differences* in the package
    >>> check results ... so we could see ... ))

    >>> I do agree that we should strongly consider such a change.

    >> as nobody has commented, I've been liberal and have taken these
    >> no comments as consent.

    >> Hence I have committed

    >> ------------------------------------------------------------------------
    >> r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line
    >> Changed paths:
    >> M /trunk/doc/NEWS.Rd
    >> M /trunk/src/library/base/R/factor.R
    >> M /trunk/src/library/base/man/factor.Rd
    >> M /trunk/tests/reg-tests-1b.R
    >> M /trunk/tests/reg-tests-1c.R

    >> factor(x, exclude) more "rational" when x or exclude are character
    >> ------------------------------------------------------------------------

    >> which apart from documentation, examples, and regression tests
    >> is just the patch below.

    >> Martin Maechler
    >> ETH Zurich



    >>> --- factor.R    (revision 71157)
    >>> +++ factor.R    (working copy)
    >>> @@ -28,8 +28,12 @@
    >>> levels <- unique(y[ind])
    >>> }
    >>> force(ordered) # check if original x is an ordered factor
    >>> -    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >>> -    x <- as.character(x)
    >>> +    if(!is.character(x)) {
    >>> +    if(!is.character(exclude))
    >>> +        exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >>> +    x <- as.character(x)
    >>> +    } else
    >>> +    exclude <- as.vector(exclude, typeof(x)) # may result in NA
    >>> ## levels could be a long vectors, but match will not handle that.
    >>> levels <- levels[is.na(match(levels, exclude))]
    >>> f <- match(x, levels)
    >> Delete Reply Reply All Forward Apply

    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel