Skip to content

Mixed sorting/ordering of strings acknowledging roman numerals?

2 messages · Henrik Bengtsson, David Winsemius

#
Thank you David - it took me awhile to get back to this and dig into
it.  It's clever to imitate gtools::mixedorder() as far as possible.
A few comments:

1. It took me a while to understand why you picked 3899 in your
Roman-to-integer table; it's because roman(x) is NA for x > 3899.
(BTW, in 'utils', there's utils:::.roman2numeric() which could be
utilized, but it's currently internal.)

2. I think you forgot D=500 and M=1000.

3. There was a typo in your code; I think you meant rank.roman instead
of rank.numeric in one place.

4. The idea behind nonnumeric() is to identify non-numeric substrings
by is.na(as.numeric()).  Unfortunately, for romans that does not work.
Instead, we need to use is.na(numeric(x)) here, i.e.

  nonnumeric <- function(x) {
      suppressWarnings(ifelse(is.na(numeric(x)), toupper(x), NA))
  }

Actually, gtools::mixedorder() could use the same.

5. I undid your ".numeric" to ".roman" to minimize any differences to
gtools::mixedorder().


With the above fixes, we now have:

mixedorderRoman <- function (x)
{
    if (length(x) < 1)
        return(NULL)
    else if (length(x) == 1)
        return(1)
    if (is.numeric(x))
        return(order(x))
    delim = "\\$\\@\\$"
    # NOTE: Note that as.roman(x) is NA for x > 3899
    romanC <- as.character( as.roman(1:3899) )
    numeric <- function(x) {
        suppressWarnings(match(x, romanC))
    }
    nonnumeric <- function(x) {
        suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
            NA))
    }
    x <- as.character(x)
    which.nas <- which(is.na(x))
    which.blanks <- which(x == "")
    if (length(which.blanks) > 0)
        x[which.blanks] <- -Inf
    if (length(which.nas) > 0)
        x[which.nas] <- Inf
    delimited <- gsub("([IVXCLM]+)",
        paste(delim, "\\1", delim, sep = ""), x)
    step1 <- strsplit(delimited, delim)
    step1 <- lapply(step1, function(x) x[x > ""])
    step1.numeric <- lapply(step1, numeric)
    step1.character <- lapply(step1, nonnumeric)
    maxelem <- max(sapply(step1, length))
    step1.numeric.t <- lapply(1:maxelem, function(i) sapply(step1.numeric,
        function(x) x[i]))
    step1.character.t <- lapply(1:maxelem, function(i) sapply(step1.character,
        function(x) x[i]))
    rank.numeric <- sapply(step1.numeric.t, rank)
    rank.character <- sapply(step1.character.t, function(x)
as.numeric(factor(x)))
    rank.numeric[!is.na(rank.character)] <- 0
    rank.character <- t(t(rank.character) + apply(matrix(rank.numeric),
        2, max, na.rm = TRUE))
    rank.overall <- ifelse(is.na(rank.character), rank.numeric,
        rank.character)
    order.frame <- as.data.frame(rank.overall)
    if (length(which.nas) > 0)
        order.frame[which.nas, ] <- Inf
    retval <- do.call("order", order.frame)
    return(retval)
}


The difference to gtools::mixedorder() is minimal:

<     romanC <- as.character( as.roman(1:3899) )
21c11
<         suppressWarnings(match(x, romanC))
---
24c14
<         suppressWarnings(ifelse(is.na(numeric(x)), toupper(x),
---
34c24
<     delimited <- gsub("([IVXCLDM]+)",
---
59,62d48

This difference is so small that the above could now be an option to
mixedorder() with minimal overhead added, e.g. mixedorder(y,
type=c("decimal", "roman")).  One could even imagine adding support
for "binary", "octal" and "hexadecimal" (not done).

Greg (maintainer of gtools; cc:ed), is this something you would
consider adding to gtools?  I've modified the gtools source code
available on CRAN (that's the only source I found), added package
tests, updated the Rd and verified it passes R CMD check.  If
interested, please find the updates at:

  https://github.com/HenrikBengtsson/gtools/compare/cran:master...master

Thanks

Henrik
On Tue, Aug 26, 2014 at 6:46 PM, David Winsemius <dwinsemius at comcast.net> wrote:
#
On Sep 7, 2014, at 7:40 PM, Henrik Bengtsson wrote:

            
Yes, that was the reason. I didn't think I needed a Roman-to-numeric function because I discovered the roman numbers were actually simple numeric vectors to which a class had been assigned and that it was the class-facilities that did all the work. The standard Ops functions were just acting on numeric vectors.

If one doesn't take care, their "romanity" can be lost:
[1] I    X    C    M    <NA>
[1]    1   10  100 1000   NA
[1] 1111
[1] MCXI
Quite possible. I suspect Greg will have corrected the omission, but if not, this will be helpful to him.
I understood Greg's intention to wrap this into the mixedorder and mixed sort duo.

Best;
David.
David Winsemius
Alameda, CA, USA