Bug in rank with utf8?

Thu, Aug 13, 2015 11:10 PM

On 13/08/2015 15:19, peter dalgaard wrote:

And remember that on some platforms (including yours) ICU is used, so 
LC_COLLATE is not particularly relevant (unless it is 'C').  See 
?Comparisons and ?icuGetCollate.

E.g. on my Yosemite system in en_US.UTF-8

[1] 1.5 1.5

[1] "root"

[1] 2 1

whereas on Fedora 21

[1] 2 1

[1] "root"

Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined.

To add to the confusion, on OSX Mavericks, I see

x <- "\u0663"
y <- 3

x == y

[1] FALSE

rank(c(x, y))

[1] 2 1

[1] "?"

x == y

[1] FALSE

x > y

[1] TRUE

x < y

[1] FALSE

Sys.getlocale()

[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Sys.getlocale("LC_COLLATE")

[1] "en_US.UTF-8"

Notice the differences from en_US.UTF8 (sans hyphen) on your system....

-pd

On 13 Aug 2015, at 16:01 , John McKown <john.archie.mckown at gmail.com> wrote:

2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wickham at gmail.com>:

x <- "\u0663"
y <- 3

x == y
# FALSE
rank(c(x, y))
# c(1.5, 1.5)

?also interesting, and confusing to me:

x == y

[1] FALSE

x > y

[1] FALSE

x < y

[1] FALSE

With some slight changes:

x <- "\u0663"
y <- "3"
xy <- c(x,y)
rank(xy);

[1] 1.5 1.5

Sys.getlocale();

[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"

Sys.setlocale(category="LC_COLLATE", locale="C");

[1] "C"

rank(xy);

[1] 2 1

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

Bug in rank with utf8?

Thread (6 messages)