Skip to content

sort yields different results on OS X (PR#14163)

5 messages · jeffreys at rand.org, Peter Dalgaard, Brian Ripley +1 more

#
Full_Name: Jeffrey Sullivan
Version: 2.10
OS: Mac
Submission from: (NULL) (130.154.0.250)


Sort produces different results when sorting strings with non-alphanumeric
characters, depending on the operating system:

RHEL 5.2, R 2.10.0
-------------
[1] "en_US.UTF-8"
[1] "<0" "1"  "2"  ">3"

Max OS 10.5.8, R 2.10.1
-------------------
[1] "en_US.UTF-8"
[1] "<0" ">3" "1"  "2"
#
As the help says

      The sort order for character vectors will depend on the collating
      sequence of the locale in use: see ?Comparison?.

and that ref says

      Collation of
      non-letters (spaces, punctuation signs, hyphens, fractions and so
      on) is even more problematic.

That different OSes use the same name for a locale does not make them 
the same locale.

Note that R can be compiled to use ICU, which provides a 
well-considered collation suite.  R on Mac OS X uses ICU, as does a 
Linux build if it is available -- so I would say that it is RHEL that 
is out of line here (it makes little sense to have < and > far apart 
in the collation sequence).

Why did you report a documented difference as a bug?
On Mon, 21 Dec 2009, jeffreys at rand.org wrote:

            

  
    
#
Prof Brian Ripley wrote:

            
That's not it:

 > v <- c("1","<0","<3","2")
 > sort(v)
[1] "<0" "1"  "2"  "<3"

The point is rather that "special characters" are ignored during collation.

Apparently, this comes from /usr/share/i18n/locales/iso14651_t1_common 
on Fedora; I wouldn't know how faithful to the ISO standard that is.
#
On Tue, 22 Dec 2009, Peter Dalgaard wrote:

            
Sometimes ....
ISO 14651 is a version of the Unicode Collation Algorithm 
(http://www.unicode.org/reports/tr10/) which ICU uses.  So other 
people have implemented the same set of rules to give different 
results -- which is quite possible given the number of non-prescribed 
choices that need to be made.

We've seen too many anomalies from glibc to trust it: which is why ICU 
is used if available.
#
On Dec 22, 2009, at 4:18 AM, Prof Brian Ripley wrote:

            
Because it wasn't clear to me from the documentation what sort of  
"problematic" behaviors were covered as documented differences vs  
unexpected behavior. Other OSS projects I have been involved with have  
a "when in doubt, file a bug" policy. If that isn't the case with R, I  
won't do so in the future.

Thank you for the pointer towards ICU. RHEL has some of the ICU  
libraries, but the icuSetCollate function returns a warning that R was  
not built with them. Including a reference to this function in the  
"See Also" for Comparison would make this info a little easier to find.

Thanks for your time,
Jeff