Improvement: function cut
On 9/18/21 5:28 AM, Leonard Mada via R-help wrote:
Hello Andrew, I add this info as a completion (so other users can get a better understanding): If we want to perform a survival analysis, than the interval should be closed to the right, but we should include also the first time point (as per Intention-to-Treat): [0, 4](4, 8](8, 12](12, 16] [0, 4](4, 8](8, 12](12, 16](16, 20] So the series is extendible to the right without any errors! But the 1st interval (which is the same in both series) is different from the other intervals: [0, 4]. I feel that this should have been the default behaviour for cut().
To Leonard; If you do not like the behavior of `cut`, then you should "roll your own". It's very unlikely that R Core will modify a base cunction like cut. You might want to look at Hmisc::cut2. Frank Harrell didn't like that default behavior and thought he could make a better cut, so he just put it in his package. I did like his version better and often used it when I was actively programming. I suspect there is also a tidyverse cut-like function, but I'm not terribly familiar with that fork of R. (It's really not the same language IMHO.) But it's a waste of time and energy to try propose modifications of core R functions unless *you* can show that it is stable across 20,000 packages and will not offend long-time users. The likelihood? of that happening for your proposal is vanishing small in my estimation. You shouldn't ask R Core to do that for you. They are busy fixing real bugs. If you want to persist despite my negativity, then you should make a complete proposal by submitting a proper diff file that incorporates your tested efforts to the Rdevel mailing list.
David > > Note: > > I was induced to think about a different situation in my previous > message, as you constructed open intervals on the right, and also > extended to the right. But survival analysis should be as described in > this mail and should probably be the default. > > > Sincerely, > > > Leonard > > > On 9/18/2021 1:29 AM, Andrew Simmons wrote: >> I disagree, I don't really think it's too long or ugly, but if you >> think it is, you could abbreviate it as 'i'. >> >> >> x <- 0:20 >> breaks1 <- seq.int <http://seq.int>(0, 16, 4) >> breaks2 <- seq.int <http://seq.int>(0, 20, 4) >> data.frame( >> ? ? cut(x, breaks1, right = FALSE, i = TRUE), >> ? ? cut(x, breaks2, right = FALSE, i = TRUE), >> ? ? check.names = FALSE >> ) >> >> >> I hope this helps. >> >> On Fri, Sep 17, 2021 at 6:26 PM Leonard Mada <leo.mada at syonic.eu >> <mailto:leo.mada at syonic.eu>> wrote: >> >> Hello Andrew, >> >> >> But "cut" generates factors. In most cases with real data one >> expects to have also the ends of the interval: the argument >> "include.lowest" is both ugly and too long. >> >> [The test-code on the ftable thread contains this error! I have >> run through this error a couple of times.] >> >> >> The only real situation that I can imagine to be problematic: >> >> - if the interval goes to +Inf (or -Inf): I do not know if there >> would be any effects when including +Inf (or -Inf). >> >> >> Leonard >> >> >> On 9/18/2021 1:14 AM, Andrew Simmons wrote: >>> While it is not explicitly mentioned anywhere in the >>> documentation for .bincode, I suspect 'include.lowest = FALSE' is >>> the default to keep the definitions of the bins consistent. For >>> example: >>> >>> >>> x <- 0:20 >>> breaks1 <- seq.int <http://seq.int>(0, 16, 4) >>> breaks2 <- seq.int <http://seq.int>(0, 20, 4) >>> cbind( >>> ? ? .bincode(x, breaks1, right = FALSE, include.lowest = TRUE), >>> ? ? .bincode(x, breaks2, right = FALSE, include.lowest = TRUE) >>> ) >>> >>> >>> by having 'include.lowest = TRUE' with different ends, you can >>> get inconsistent behaviour. While this probably wouldn't be an >>> issue with 'real' data, this would seem like something you'd want >>> to avoid by default. The definitions of the bins are >>> >>> >>> [0, 4) >>> [4, 8) >>> [8, 12) >>> [12, 16] >>> >>> >>> and >>> >>> >>> [0, 4) >>> [4, 8) >>> [8, 12) >>> [12, 16) >>> [16, 20] >>> >>> >>> so you can see where the inconsistent behaviour comes from. You >>> might be able to get R-core to add argument 'warn', but probably >>> not to change the default of 'include.lowest'. I hope this helps >>> >>> >>> On Fri, Sep 17, 2021 at 6:01 PM Leonard Mada <leo.mada at syonic.eu >>> <mailto:leo.mada at syonic.eu>> wrote: >>> >>> Thank you Andrew. >>> >>> >>> Is there any reason not to make: include.lowest = TRUE the >>> default? >>> >>> >>> Regarding the NA: >>> >>> The user still has to suspect that some values were not >>> included and run that test. >>> >>> >>> Leonard >>> >>> >>> On 9/18/2021 12:53 AM, Andrew Simmons wrote: >>>> Regarding your first point, argument 'include.lowest' >>>> already handles this specific case, see ?.bincode >>>> >>>> Your second point, maybe it could be helpful, but since both >>>> 'cut.default' and '.bincode' return NA if a value isn't >>>> within a bin, you could make something like this on your own. >>>> Might be worth pitching to R-bugs on the wishlist. >>>> >>>> >>>> >>>> On Fri, Sep 17, 2021, 17:45 Leonard Mada via R-help >>>> <r-help at r-project.org <mailto:r-help at r-project.org>> wrote: >>>> >>>> Hello List members, >>>> >>>> >>>> the following improvements would be useful for function >>>> cut (and .bincode): >>>> >>>> >>>> 1.) Argument: Include extremes >>>> extremes = TRUE >>>> if(right == FALSE) { >>>> ??? # include also right for last interval; >>>> } else { >>>> ??? # include also left for first interval; >>>> } >>>> >>>> >>>> 2.) Argument: warn = TRUE >>>> >>>> Warn if any values are not included in the intervals. >>>> >>>> >>>> Motivation: >>>> - reduce risk of errors when using function cut(); >>>> >>>> >>>> Sincerely, >>>> >>>> >>>> Leonard >>>> >>>> ______________________________________________ >>>> R-help at r-project.org <mailto:R-help at r-project.org> >>>> mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> <https://stat.ethz.ch/mailman/listinfo/r-help> >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> <http://www.R-project.org/posting-guide.html> >>>> and provide commented, minimal, self-contained, >>>> reproducible code. >>>> > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.