Skip to content

Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

2 messages · Søren Højsgaard, Brian Ripley

#
Dear list,

Please consider the following call of sort
[1] "a" "f"
[1] "a" "f"
[1] "ff" "aa"
[1] "ff" "aa"
The last two results look strange to me. Is that a bug???

The result seems to come from calls to order:
[1] 1 2
[1] 2 1
[1] 2 1
[1] 1 2
I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7. However on Linux, I get the "right answer" (the answer I expected). From the help pages I get the impression that there might be an issue about locale, but I didn't understand the details.

Can anyone tell me what goes on here, please

Regards
S?ren
R version 2.12.1 Patched (2010-12-27 r53883)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
[5] LC_TIME=Danish_Denmark.1252
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] SHDtools_1.0
R version 2.12.1 (2010-12-16)
Platform: i686-pc-linux-gnu (32-bit)
locale:
 [1] LC_CTYPE=en_DK.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_DK.utf8        LC_COLLATE=en_DK.utf8
 [5] LC_MONETARY=C             LC_MESSAGES=en_DK.utf8
 [7] LC_PAPER=en_DK.utf8       LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
#
On Mon, 24 Jan 2011, S?ren H?jsgaard wrote:

            
It seems that you and your OS disagree about Danish, and I'm in no 
position to know which is correct.  But this is not an R issue: the 
sorting is done by OS services.
I recall that 'aa' used to sort at the end of the alphabet in Danish 
telephone books, so it seems the sort used on Windows thinks so too. 
See ?Comparison for some further details.  What I don't understand is 
that someone resident in Denmark finds this strange ....

I get exactly the same in a Danish locale on Mac OS X, for example:
[1] "ff" "aa"

and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)
[1] "ff" "aa"

en_DK is not a Danish locale (in is English in Denmark).  If you want 
an English sort, try an English locale for LC_COLLATE (there may well 
be several, hence 'an').