Skip to content

Problem with order() and I()

7 messages · MacQueen, Don, Martin Maechler, Peter Dalgaard

#
I have found that order() fails in a rather arcane circumstance, as in
this example:
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
[1] 1 2
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Thanks
-Don

p.s.
Just a little background, irrelevant unless one wonders why I?m using I()
and \265:

If I were writing new code I wouldn?t be using I(), since there are better
ways now to achieve the same end (preventing the creation of factors in
data frames), but the scripts that use it are quite old,  originally
developed in 2001.

In at least some but perhaps limited contexts, ?\265? produces the greek
letter mu, and that?s why I?m using it. And if I remember correctly, 2001
is prior to the current R support for locales and extended character sets.
Using \265 is what I could find at that time to get a mu into my output.

I came across this while checking some things; it?s not actually breaking
my scripts, so I doubt it?s due to any recent change.
#
> I have found that order() fails in a rather arcane circumstance, as in
    > this example:

    >> foo <- I( c('x','\265g') )
    >> order(foo)
    > Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed

    >> foo <-c('x','\265g')
    >> order(foo)
    > [1] 1 2

yes, this is not desirable.
order() in such cases calls xtfrm()  {as documented}
and that ends up calling rank() and then the internal  .gt()
where the bug happens because

 > I("x") > I("\xb5g")
 [1] NA

but really I think the change should happen in xtfrm.Asis(.)
which I think should drop the class also in this case.

More on this, once we have fixed it.

Thank you, Don, very much!

Martin Maechler,
ETH Zurich

    >> sessionInfo()
    > R version 3.1.1 (2014-07-10)
    > Platform: x86_64-apple-darwin13.1.0 (64-bit)

    > locale:
    > [1] C

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > Thanks
    > -Don

    > p.s.
    > Just a little background, irrelevant unless one wonders why I?m using I()
    > and \265:

    > If I were writing new code I wouldn?t be using I(), since there are better
    > ways now to achieve the same end (preventing the creation of factors in
    > data frames), but the scripts that use it are quite old,  originally
    > developed in 2001.

    > In at least some but perhaps limited contexts, ?\265? produces the greek
    > letter mu, and that?s why I?m using it. And if I remember correctly, 2001
    > is prior to the current R support for locales and extended character sets.
    > Using \265 is what I could find at that time to get a mu into my output.

    > I came across this while checking some things; it?s not actually breaking
    > my scripts, so I doubt it?s due to any recent change.


    > -- 
    > Don MacQueen

    > Lawrence Livermore National Laboratory
    > 7000 East Ave., L-627
    > Livermore, CA 94550
    > 925-423-1062

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
It's actually a little more complicated. I wrote a note, but it seems to be stuck in the outbox on my home machine (I probably forgot to click Send...). 

One important aspect is that
[1] NA

which makes me wonder if the bug really is in the case that "works". It seems that it is possible to rank() character vectors that contain incomparable elements.

-pd
On 09 Sep 2014, at 16:19 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:

            

  
    
#
You are welcome.

-Don



Sent with Good (www.good.com)


-----Original Message-----
From: Martin Maechler [maechler at stat.math.ethz.ch<mailto:maechler at stat.math.ethz.ch>]
Sent: Tuesday, September 09, 2014 07:19 AM Pacific Standard Time
To: MacQueen, Don
Cc: R-devel at r-project.org
Subject: Re: [Rd] Problem with order() and I()
> I have found that order() fails in a rather arcane circumstance, as in
    > this example:

    >> foo <- I( c('x','\265g') )
    >> order(foo)
    > Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed

    >> foo <-c('x','\265g')
    >> order(foo)
    > [1] 1 2

yes, this is not desirable.
order() in such cases calls xtfrm()  {as documented}
and that ends up calling rank() and then the internal  .gt()
where the bug happens because

 > I("x") > I("\xb5g")
 [1] NA

but really I think the change should happen in xtfrm.Asis(.)
which I think should drop the class also in this case.

More on this, once we have fixed it.

Thank you, Don, very much!

Martin Maechler,
ETH Zurich

    >> sessionInfo()
    > R version 3.1.1 (2014-07-10)
    > Platform: x86_64-apple-darwin13.1.0 (64-bit)

    > locale:
    > [1] C

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > Thanks
    > -Don

    > p.s.
    > Just a little background, irrelevant unless one wonders why I?m using I()
    > and \265:

    > If I were writing new code I wouldn?t be using I(), since there are better
    > ways now to achieve the same end (preventing the creation of factors in
    > data frames), but the scripts that use it are quite old,  originally
    > developed in 2001.

    > In at least some but perhaps limited contexts, ?\265? produces the greek
    > letter mu, and that?s why I?m using it. And if I remember correctly, 2001
    > is prior to the current R support for locales and extended character sets.
    > Using \265 is what I could find at that time to get a mu into my output.

    > I came across this while checking some things; it?s not actually breaking
    > my scripts, so I doubt it?s due to any recent change.


    > --
    > Don MacQueen

    > Lawrence Livermore National Laboratory
    > 7000 East Ave., L-627
    > Livermore, CA 94550
    > 925-423-1062

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel
#
> It's actually a little more complicated. I wrote a note, but it seems to be stuck in the outbox on my home machine (I probably forgot to click Send...). 
    > One important aspect is that

    >> "x" < "\265g"
    > [1] NA

    > which makes me wonder if the bug really is in the case that "works". It seems that it is possible to rank() character vectors that contain incomparable elements.

    > -pd

yes you are right that it is even more complicated.
In both cases, our Scollate() is involved,
(Scollate: the one where we had a discussion about making it part of the C
 level R API, which would help package authors ..)

After

  ch <- c('x','\265g')
  foo <- I(ch)

Of the four expressions,

  order(ch)
  order(foo)
  ch [1] < ch [2]
  foo[1] < foo[2]

only the first one "works", the others give NA or an error because of NA
and the first one is the only of the 4 that does not use
do_relop_dflt()

It's not even clear what we'd want (as I think  pd also alluded to):
Ideally all of these should work consistently, which because of
 "<(.,.)" returning NA in both cases,
would mean that order(ch) also should give an error as order(foo)
    {{ an error we should improve the message in any case!!}.
Big Q:  Can we afford  order(ch)  giving an error in such cases.
Pretty high chance that this will "break" much user (and probably
even package) code out there.

Still, the other solution, namely  order(foo) behaving as
order(ch) now does would correspond to the ">" giving FALSE
instead of NA, so this solution is not ok in my view.

Martin
> On 09 Sep 2014, at 16:19 , Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>>>>>> MacQueen, Don <macqueen1 at llnl.gov>
    >>>>>>> on Mon, 8 Sep 2014 16:06:21 +0000 writes:
    >> 
    >>> I have found that order() fails in a rather arcane circumstance, as in
    >>> this example:
    >> 
    >>>> foo <- I( c('x','\265g') )
    >>>> order(foo)
    >>> Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
    >> 
    >>>> foo <-c('x','\265g')
    >>>> order(foo)
    >>> [1] 1 2
    >> 
    >> yes, this is not desirable.
    >> order() in such cases calls xtfrm()  {as documented}
    >> and that ends up calling rank() and then the internal  .gt()
    >> where the bug happens because
    >> 
    >>> I("x") > I("\xb5g")
    >> [1] NA
    >> 
    >> but really I think the change should happen in xtfrm.Asis(.)
    >> which I think should drop the class also in this case.
    >> 
    >> More on this, once we have fixed it.
    >> 
    >> Thank you, Don, very much!
    >> 
    >> Martin Maechler,
    >> ETH Zurich
    >> 
    >>>> sessionInfo()
    >>> R version 3.1.1 (2014-07-10)
    >>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
    >> 
    >>> locale:
    >>> [1] C
    >> 
    >>> attached base packages:
    >>> [1] stats     graphics  grDevices utils     datasets  methods   base
    >> 
    >>> Thanks
    >>> -Don
    >> 
    >>> p.s.
    >>> Just a little background, irrelevant unless one wonders why I?m using I()
    >>> and \265:
    >> 
    >>> If I were writing new code I wouldn?t be using I(), since there are better
    >>> ways now to achieve the same end (preventing the creation of factors in
    >>> data frames), but the scripts that use it are quite old,  originally
    >>> developed in 2001.
    >> 
    >>> In at least some but perhaps limited contexts, ?\265? produces the greek
    >>> letter mu, and that?s why I?m using it. And if I remember correctly, 2001
    >>> is prior to the current R support for locales and extended character sets.
    >>> Using \265 is what I could find at that time to get a mu into my output.
    >> 
    >>> I came across this while checking some things; it?s not actually breaking
    >>> my scripts, so I doubt it?s due to any recent change.
    >> 
    >> 
    >>> -- 
    >>> Don MacQueen
    >> 
    >>> Lawrence Livermore National Laboratory
    >>> 7000 East Ave., L-627
    >>> Livermore, CA 94550
    >>> 925-423-1062
    >> 
    >>> ______________________________________________
    >>> R-devel at r-project.org mailing list
    >>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >> 
    >> ______________________________________________
    >> R-devel at r-project.org mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-devel

    > -- 
    > Peter Dalgaard, Professor,
    > Center for Statistics, Copenhagen Business School
    > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
    > Phone: (+45)38153501
    > Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
#
[This is the note I alluded to earlier today.]
On 08 Sep 2014, at 18:06 , MacQueen, Don <macqueen1 at llnl.gov> wrote:

            
The oddity is really that it works (for some value of "works") in the unclassed case:
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
[1] "x"
[1] "\xb5g"
[1] NA
[1] NA
[1] "x"
[1] "\xb5g"
[1] NA
[1] NA
[1] 2 1

Notice that the unclassed `fee` has exactly the same issue that its elements are incomparable as `foo` does.

The thing is that xtfrm.AsIs will use elementwise comparison, whereas xtfrm.default will use rank(), which somehow manages to do something with character vectors for which the sort order is undefined:
Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
[1] 2 1

(Notice that xtfrm calls rank and vice versa, presumably without creating a loop. I gave up on sorting out the logic.)

  
    
#
Early on I had been wondering if deprecating I() and the AsIs class would
be a way to get the problem to go away. I imagine (based on no data at
all!) that they are rarely used. If I were writing the same code today, I
would use options(stringsAsFactors=FALSE) instead of sprinkling I() here
and there throughout my scripts.

But I see from the discussions that there?s something deeper going on.

Thanks for continuing to cc me; I find it interesting.

-Don