Message-ID: <CAGxFJbSSy414y1zVNvrRNZZJ=6RLn1-uOt9=emE6KcL3vMkWuw@mail.gmail.com>
Date: 2021-09-17T22:57:27Z
From: Bert Gunter
Subject: Improvement: function cut
In-Reply-To: <db61fc46-dbe3-90fb-de07-3f9825a3daf3@syonic.eu>
Perhaps you and Andrew should take this discussion off list...
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Sep 17, 2021 at 3:45 PM Leonard Mada via R-help
<r-help at r-project.org> wrote:
>
> Why would you want to merge different factors?
>
> It makes no sense on real data. Even if some names are the same, the
> factors are not the same!
>
>
> The only real-data application that springs to mind is censoring (right
> or left, depending on the choice): but here we have both open and closed
> intervals, e.g. to the right (in the same data-set).
>
>
> Leonard
>
>
> On 9/18/2021 1:29 AM, Andrew Simmons wrote:
> > I disagree, I don't really think it's too long or ugly, but if you
> > think it is, you could abbreviate it as 'i'.
> >
> >
> > x <- 0:20
> > breaks1 <- seq.int <http://seq.int>(0, 16, 4)
> > breaks2 <- seq.int <http://seq.int>(0, 20, 4)
> > data.frame(
> > cut(x, breaks1, right = FALSE, i = TRUE),
> > cut(x, breaks2, right = FALSE, i = TRUE),
> > check.names = FALSE
> > )
> >
> >
> > I hope this helps.
> >
> > On Fri, Sep 17, 2021 at 6:26 PM Leonard Mada <leo.mada at syonic.eu
> > <mailto:leo.mada at syonic.eu>> wrote:
> >
> > Hello Andrew,
> >
> >
> > But "cut" generates factors. In most cases with real data one
> > expects to have also the ends of the interval: the argument
> > "include.lowest" is both ugly and too long.
> >
> > [The test-code on the ftable thread contains this error! I have
> > run through this error a couple of times.]
> >
> >
> > The only real situation that I can imagine to be problematic:
> >
> > - if the interval goes to +Inf (or -Inf): I do not know if there
> > would be any effects when including +Inf (or -Inf).
> >
> >
> > Leonard
> >
> >
> > On 9/18/2021 1:14 AM, Andrew Simmons wrote:
> >> While it is not explicitly mentioned anywhere in the
> >> documentation for .bincode, I suspect 'include.lowest = FALSE' is
> >> the default to keep the definitions of the bins consistent. For
> >> example:
> >>
> >>
> >> x <- 0:20
> >> breaks1 <- seq.int <http://seq.int>(0, 16, 4)
> >> breaks2 <- seq.int <http://seq.int>(0, 20, 4)
> >> cbind(
> >> .bincode(x, breaks1, right = FALSE, include.lowest = TRUE),
> >> .bincode(x, breaks2, right = FALSE, include.lowest = TRUE)
> >> )
> >>
> >>
> >> by having 'include.lowest = TRUE' with different ends, you can
> >> get inconsistent behaviour. While this probably wouldn't be an
> >> issue with 'real' data, this would seem like something you'd want
> >> to avoid by default. The definitions of the bins are
> >>
> >>
> >> [0, 4)
> >> [4, 8)
> >> [8, 12)
> >> [12, 16]
> >>
> >>
> >> and
> >>
> >>
> >> [0, 4)
> >> [4, 8)
> >> [8, 12)
> >> [12, 16)
> >> [16, 20]
> >>
> >>
> >> so you can see where the inconsistent behaviour comes from. You
> >> might be able to get R-core to add argument 'warn', but probably
> >> not to change the default of 'include.lowest'. I hope this helps
> >>
> >>
> >> On Fri, Sep 17, 2021 at 6:01 PM Leonard Mada <leo.mada at syonic.eu
> >> <mailto:leo.mada at syonic.eu>> wrote:
> >>
> >> Thank you Andrew.
> >>
> >>
> >> Is there any reason not to make: include.lowest = TRUE the
> >> default?
> >>
> >>
> >> Regarding the NA:
> >>
> >> The user still has to suspect that some values were not
> >> included and run that test.
> >>
> >>
> >> Leonard
> >>
> >>
> >> On 9/18/2021 12:53 AM, Andrew Simmons wrote:
> >>> Regarding your first point, argument 'include.lowest'
> >>> already handles this specific case, see ?.bincode
> >>>
> >>> Your second point, maybe it could be helpful, but since both
> >>> 'cut.default' and '.bincode' return NA if a value isn't
> >>> within a bin, you could make something like this on your own.
> >>> Might be worth pitching to R-bugs on the wishlist.
> >>>
> >>>
> >>>
> >>> On Fri, Sep 17, 2021, 17:45 Leonard Mada via R-help
> >>> <r-help at r-project.org <mailto:r-help at r-project.org>> wrote:
> >>>
> >>> Hello List members,
> >>>
> >>>
> >>> the following improvements would be useful for function
> >>> cut (and .bincode):
> >>>
> >>>
> >>> 1.) Argument: Include extremes
> >>> extremes = TRUE
> >>> if(right == FALSE) {
> >>> # include also right for last interval;
> >>> } else {
> >>> # include also left for first interval;
> >>> }
> >>>
> >>>
> >>> 2.) Argument: warn = TRUE
> >>>
> >>> Warn if any values are not included in the intervals.
> >>>
> >>>
> >>> Motivation:
> >>> - reduce risk of errors when using function cut();
> >>>
> >>>
> >>> Sincerely,
> >>>
> >>>
> >>> Leonard
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org <mailto:R-help at r-project.org>
> >>> mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> <https://stat.ethz.ch/mailman/listinfo/r-help>
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> <http://www.R-project.org/posting-guide.html>
> >>> and provide commented, minimal, self-contained,
> >>> reproducible code.
> >>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.