Problem with order() and I()

I have found that order() fails in a rather arcane circumstance, as in
this example:
foo <- I( c('x','\265g') )
order(foo)
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
foo <-c('x','\265g')
order(foo)
[1] 1 2
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Thanks
-Don

p.s.
Just a little background, irrelevant unless one wonders why I?m using I()
and \265:

If I were writing new code I wouldn?t be using I(), since there are better
ways now to achieve the same end (preventing the creation of factors in
data frames), but the scripts that use it are quite old,  originally
developed in 2001.

In at least some but perhaps limited contexts, ?\265? produces the greek
letter mu, and that?s why I?m using it. And if I remember correctly, 2001
is prior to the current R support for locales and extended character sets.
Using \265 is what I could find at that time to get a mu into my output.

I came across this while checking some things; it?s not actually breaking
my scripts, so I doubt it?s due to any recent change.
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
MacQueen, Don <macqueen1 at llnl.gov>
    on Mon, 8 Sep 2014 16:06:21 +0000 writes:
> I have found that order() fails in a rather arcane circumstance, as in
    > this example:

    >> foo <- I( c('x','\265g') )
    >> order(foo)
    > Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed

    >> foo <-c('x','\265g')
    >> order(foo)
    > [1] 1 2

yes, this is not desirable.
order() in such cases calls xtfrm()  {as documented}
and that ends up calling rank() and then the internal  .gt()
where the bug happens because

 > I("x") > I("\xb5g")
 [1] NA

but really I think the change should happen in xtfrm.Asis(.)
which I think should drop the class also in this case.

More on this, once we have fixed it.

Thank you, Don, very much!

Martin Maechler,
ETH Zurich

    >> sessionInfo()
    > R version 3.1.1 (2014-07-10)
    > Platform: x86_64-apple-darwin13.1.0 (64-bit)

    > locale:
    > [1] C

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > Thanks
    > -Don

    > p.s.
    > Just a little background, irrelevant unless one wonders why I?m using I()
    > and \265:

    > If I were writing new code I wouldn?t be using I(), since there are better
    > ways now to achieve the same end (preventing the creation of factors in
    > data frames), but the scripts that use it are quite old,  originally
    > developed in 2001.

    > In at least some but perhaps limited contexts, ?\265? produces the greek
    > letter mu, and that?s why I?m using it. And if I remember correctly, 2001
    > is prior to the current R support for locales and extended character sets.
    > Using \265 is what I could find at that time to get a mu into my output.

    > I came across this while checking some things; it?s not actually breaking
    > my scripts, so I doubt it?s due to any recent change.

    > --
    > Don MacQueen

    > Lawrence Livermore National Laboratory
    > 7000 East Ave., L-627
    > Livermore, CA 94550
    > 925-423-1062

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
peter dalgaard <pdalgd at gmail.com>
    on Tue, 9 Sep 2014 16:36:19 +0200 writes:
> It's actually a little more complicated. I wrote a note, but it seems to be stuck in the outbox on my home machine (I probably forgot to click Send...). 
    > One important aspect is that

    >> "x" < "\265g"
    > [1] NA

    > which makes me wonder if the bug really is in the case that "works". It seems that it is possible to rank() character vectors that contain incomparable elements.

    > -pd

yes you are right that it is even more complicated.
In both cases, our Scollate() is involved,
(Scollate: the one where we had a discussion about making it part of the C
 level R API, which would help package authors ..)

After

  ch <- c('x','\265g')
  foo <- I(ch)

Of the four expressions,

  order(ch)
  order(foo)
  ch [1] < ch [2]
  foo[1] < foo[2]

only the first one "works", the others give NA or an error because of NA
and the first one is the only of the 4 that does not use
do_relop_dflt()

It's not even clear what we'd want (as I think  pd also alluded to):
Ideally all of these should work consistently, which because of
 "<(.,.)" returning NA in both cases,
would mean that order(ch) also should give an error as order(foo)
    {{ an error we should improve the message in any case!!}.
Big Q:  Can we afford  order(ch)  giving an error in such cases.
Pretty high chance that this will "break" much user (and probably
even package) code out there.

Still, the other solution, namely  order(foo) behaving as
order(ch) now does would correspond to the ">" giving FALSE
instead of NA, so this solution is not ok in my view.

Martin
>>>>>>> MacQueen, Don <macqueen1 at llnl.gov>
    >>>>>>> on Mon, 8 Sep 2014 16:06:21 +0000 writes:
    >> 
    >>> I have found that order() fails in a rather arcane circumstance, as in
    >>> this example:
    >> 
    >>>> foo <- I( c('x','\265g') )
    >>>> order(foo)
    >>> Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
    >> 
    >>>> foo <-c('x','\265g')
    >>>> order(foo)
    >>> [1] 1 2
    >> 
    >> yes, this is not desirable.
    >> order() in such cases calls xtfrm()  {as documented}
    >> and that ends up calling rank() and then the internal  .gt()
    >> where the bug happens because
    >> 
    >>> I("x") > I("\xb5g")
    >> [1] NA
    >> 
    >> but really I think the change should happen in xtfrm.Asis(.)
    >> which I think should drop the class also in this case.
    >> 
    >> More on this, once we have fixed it.
    >> 
    >> Thank you, Don, very much!
    >> 
    >> Martin Maechler,
    >> ETH Zurich
    >> 
    >>>> sessionInfo()
    >>> R version 3.1.1 (2014-07-10)
    >>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
    >> 
    >>> locale:
    >>> [1] C
    >> 
    >>> attached base packages:
    >>> [1] stats     graphics  grDevices utils     datasets  methods   base
    >> 
    >>> Thanks
    >>> -Don
    >> 
    >>> p.s.
    >>> Just a little background, irrelevant unless one wonders why I?m using I()
    >>> and \265:
    >> 
    >>> If I were writing new code I wouldn?t be using I(), since there are better
    >>> ways now to achieve the same end (preventing the creation of factors in
    >>> data frames), but the scripts that use it are quite old,  originally
    >>> developed in 2001.
    >> 
    >>> In at least some but perhaps limited contexts, ?\265? produces the greek
    >>> letter mu, and that?s why I?m using it. And if I remember correctly, 2001
    >>> is prior to the current R support for locales and extended character sets.
    >>> Using \265 is what I could find at that time to get a mu into my output.
    >> 
    >>> I came across this while checking some things; it?s not actually breaking
    >>> my scripts, so I doubt it?s due to any recent change.
    >> 
    >> 
    >>> -- 
    >>> Don MacQueen
    >> 
    >>> Lawrence Livermore National Laboratory
    >>> 7000 East Ave., L-627
    >>> Livermore, CA 94550
    >>> 925-423-1062
    >> 
    >>> ______________________________________________
    >>> R-devel at r-project.org mailing list
    >>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > -- 
    > Peter Dalgaard, Professor,
    > Center for Statistics, Copenhagen Business School
    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    > Phone: (+45)38153501
    > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
[This is the note I alluded to earlier today.]

I have found that order() fails in a rather arcane circumstance, as in
this example:

foo <- I( c('x','\265g') )
order(foo)
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
foo <-c('x','\265g')
order(foo)
[1] 1 2

The oddity is really that it works (for some value of "works") in the unclassed case:
foo <- I( c('x','\265g') )
order(foo)
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
foo[[1]]
[1] "x"
foo[[2]]
[1] "\xb5g"
foo[[1]] < foo[[2]]
[1] NA
foo[[1]] > foo[[2]]
[1] NA
fee <- c('x','\265g') 
fee[[1]]
[1] "x"
fee[[2]]
[1] "\xb5g"
fee[[1]] < fee[[2]]
[1] NA
fee[[1]] > fee[[2]]
[1] NA
order(fee)
[1] 2 1

Notice that the unclassed `fee` has exactly the same issue that its elements are incomparable as `foo` does.

The thing is that xtfrm.AsIs will use elementwise comparison, whereas xtfrm.default will use rank(), which somehow manages to do something with character vectors for which the sort order is undefined:
rank(foo)
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
rank(fee)
[1] 2 1

(Notice that xtfrm calls rank and vice versa, presumably without creating a loop. I gave up on sorting out the logic.)

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Thanks
-Don

p.s.
Just a little background, irrelevant unless one wonders why I?m using I()
and \265:

If I were writing new code I wouldn?t be using I(), since there are better
ways now to achieve the same end (preventing the creation of factors in
data frames), but the scripts that use it are quite old,  originally
developed in 2001.

In at least some but perhaps limited contexts, ?\265? produces the greek
letter mu, and that?s why I?m using it. And if I remember correctly, 2001
is prior to the current R support for locales and extended character sets.
Using \265 is what I could find at that time to get a mu into my output.

I came across this while checking some things; it?s not actually breaking
my scripts, so I doubt it?s due to any recent change.

-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
Early on I had been wondering if deprecating I() and the AsIs class would
be a way to get the problem to go away. I imagine (based on no data at
all!) that they are rarely used. If I were writing the same code today, I
would use options(stringsAsFactors=FALSE) instead of sprinkling I() here
and there throughout my scripts.

But I see from the discussions that there?s something deeper going on.

Thanks for continuing to cc me; I find it interesting.

-Don
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062

On 9/9/14, 9:35 AM, "Martin Maechler" <maechler at stat.math.ethz.ch> wrote:

>>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>>>     on Tue, 9 Sep 2014 16:36:19 +0200 writes:
>
>    > It's actually a little more complicated. I wrote a note, but it
>seems to be stuck in the outbox on my home machine (I probably forgot to
>click Send...). 
>    > One important aspect is that
>
>    >> "x" < "\265g"
>    > [1] NA
>
>    > which makes me wonder if the bug really is in the case that
>"works". It seems that it is possible to rank() character vectors that
>contain incomparable elements.
>
>    > -pd
>
>yes you are right that it is even more complicated.
>In both cases, our Scollate() is involved,
>(Scollate: the one where we had a discussion about making it part of the C
> level R API, which would help package authors ..)
>
>After
>
>  ch <- c('x','\265g')
>  foo <- I(ch)
>
>Of the four expressions,
>
>  order(ch)
>  order(foo)
>  ch [1] < ch [2]
>  foo[1] < foo[2]
>
>only the first one "works", the others give NA or an error because of NA
>and the first one is the only of the 4 that does not use
>do_relop_dflt()
>
>It's not even clear what we'd want (as I think  pd also alluded to):
>Ideally all of these should work consistently, which because of
> "<(.,.)" returning NA in both cases,
>would mean that order(ch) also should give an error as order(foo)
>    {{ an error we should improve the message in any case!!}.
>Big Q:  Can we afford  order(ch)  giving an error in such cases.
>Pretty high chance that this will "break" much user (and probably
>even package) code out there.
>
>Still, the other solution, namely  order(foo) behaving as
>order(ch) now does would correspond to the ">" giving FALSE
>instead of NA, so this solution is not ok in my view.
>
>Martin
>
>
>    > On 09 Sep 2014, at 16:19 , Martin Maechler
><maechler at stat.math.ethz.ch> wrote:
>
>    >>>>>>> MacQueen, Don <macqueen1 at llnl.gov>
>    >>>>>>> on Mon, 8 Sep 2014 16:06:21 +0000 writes:
>    >> 
>    >>> I have found that order() fails in a rather arcane circumstance,
>as in
>    >>> this example:
>    >> 
>    >>>> foo <- I( c('x','\265g') )
>    >>>> order(foo)
>    >>> Error in if (xi > xj) 1L else -1L : missing value where
>TRUE/FALSE needed
>    >> 
>    >>>> foo <-c('x','\265g')
>    >>>> order(foo)
>    >>> [1] 1 2
>    >> 
>    >> yes, this is not desirable.
>    >> order() in such cases calls xtfrm()  {as documented}
>    >> and that ends up calling rank() and then the internal  .gt()
>    >> where the bug happens because
>    >> 
>    >>> I("x") > I("\xb5g")
>    >> [1] NA
>    >> 
>    >> but really I think the change should happen in xtfrm.Asis(.)
>    >> which I think should drop the class also in this case.
>    >> 
>    >> More on this, once we have fixed it.
>    >> 
>    >> Thank you, Don, very much!
>    >> 
>    >> Martin Maechler,
>    >> ETH Zurich
>    >> 
>    >>>> sessionInfo()
>    >>> R version 3.1.1 (2014-07-10)
>    >>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>    >> 
>    >>> locale:
>    >>> [1] C
>    >> 
>    >>> attached base packages:
>    >>> [1] stats     graphics  grDevices utils     datasets  methods
>base
>    >> 
>    >>> Thanks
>    >>> -Don
>    >> 
>    >>> p.s.
>    >>> Just a little background, irrelevant unless one wonders why I?m
>using I()
>    >>> and \265:
>    >> 
>    >>> If I were writing new code I wouldn?t be using I(), since there
>are better
>    >>> ways now to achieve the same end (preventing the creation of
>factors in
>    >>> data frames), but the scripts that use it are quite old,
>originally
>    >>> developed in 2001.
>    >> 
>    >>> In at least some but perhaps limited contexts, ?\265? produces
>the greek
>    >>> letter mu, and that?s why I?m using it. And if I remember
>correctly, 2001
>    >>> is prior to the current R support for locales and extended
>character sets.
>    >>> Using \265 is what I could find at that time to get a mu into my
>output.
>    >> 
>    >>> I came across this while checking some things; it?s not actually
>breaking
>    >>> my scripts, so I doubt it?s due to any recent change.
>    >> 
>    >> 
>    >>> -- 
>    >>> Don MacQueen
>    >> 
>    >>> Lawrence Livermore National Laboratory
>    >>> 7000 East Ave., L-627
>    >>> Livermore, CA 94550
>    >>> 925-423-1062
>    >> 
>    >>> ______________________________________________
>    >>> R-devel at r-project.org mailing list
>    >>> https://stat.ethz.ch/mailman/listinfo/r-devel
>    >> 
>    >> ______________________________________________
>    >> R-devel at r-project.org mailing list
>    >> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>    > -- 
>    > Peter Dalgaard, Professor,
>    > Center for Statistics, Copenhagen Business School
>    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>    > Phone: (+45)38153501
>    > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>