Bug in rank with utf8?
On 13/08/2015 15:19, peter dalgaard wrote:
Yes, collation is a strange thing, and?
And remember that on some platforms (including yours) ICU is used, so LC_COLLATE is not particularly relevant (unless it is 'C'). See ?Comparisons and ?icuGetCollate. E.g. on my Yosemite system in en_US.UTF-8
rank(c(x, y))
[1] 1.5 1.5
icuGetCollate()
[1] "root"
icuSetCollate(locale="ASCII") rank(c(x, y))
[1] 2 1 whereas on Fedora 21
rank(c(x, y))
[1] 2 1
icuGetCollate()
[1] "root"
Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined. To add to the confusion, on OSX Mavericks, I see
x <- "\u0663" y <- 3 x == y
[1] FALSE
rank(c(x, y))
[1] 2 1
x
[1] "?"
x == y
[1] FALSE
x > y
[1] TRUE
x < y
[1] FALSE
Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8" Notice the differences from en_US.UTF8 (sans hyphen) on your system.... -pd On 13 Aug 2015, at 16:01 , John McKown <john.archie.mckown at gmail.com> wrote:
2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wickham at gmail.com>:
x <- "\u0663" y <- 3 x == y # FALSE rank(c(x, y)) # c(1.5, 1.5)
?also interesting, and confusing to me:
x == y
[1] FALSE
x > y
[1] FALSE
x < y
[1] FALSE
With some slight changes:
x <- "\u0663" y <- "3" xy <- c(x,y) rank(xy);
[1] 1.5 1.5
Sys.getlocale();
[1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
Sys.setlocale(category="LC_COLLATE", locale="C");
[1] "C"
rank(xy);
[1] 2 1
Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK