Skip to content
Prev 4609 / 15075 Next

a question of alphabetical order [follow-up]

Your 'nightmare' does seems specific to Mac OS.  Your example is collated 
correctly in all the es_ES locales on my Linux box, and also in 
es_ES.UTF-8 on Solaris 10.

We have no idea what data you collected to assert 'whatever platform we 
use'.  UTF-8 locales on Mac OS X are the only instance in C where I am 
aware of the use of Unicode point order (quite a few scripting languages 
do it, though).  If the problem were widespread I would expect it to be 
reported more than it is (and 'ls' output is locale-specific in recent 
versions of Linux, and my IT team did get several help requests about 
that).

Collation is a tricky area, but that does not mean that OS designers are 
in general shy of it.  There was a concerted project, the Unicode 
Collation Algorithm, and several OSes have implementations including 
national 'tailorings'.

What can be done about it?  The obvious answer is to use a reliable OS. 
Alternatively, R is making use of the system's C collation functions and 
those could be replaced.  In current R (>= 2.7.0) this is centralized in 
src/main/utils.c, in the code (not Windows)

# ifdef HAVE_STRCOLL
#  define STRCOLL strcoll
# else
#  define STRCOLL strcmp
# endif

int Scollate(SEXP a, SEXP b)
{
     return STRCOLL(translateChar(a), translateChar(b));
}

Mac OS X has strcoll (it is a C99 function, so that test is historical), 
and what would be needed would be to replace it by a more functional 
version.  My suspicion is that Mac OS X does have proper collation 
functionality (http://en.wikipedia.org/wiki/Common_Locale_Data_Repository 
appears to claim it uses CDLR data), but that it is not used in the ISO 
C99 part of the OS.  For example, Cocoa seems to have a function 
'localizedCompare'.


BTW, Ei-ji Nakama was already replaced the broken wctype and wcwidth 
functions in Mac OS: see file src/main/rlocale.c
On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote: