a question of alphabetical order [follow-up]
Your 'nightmare' does seems specific to Mac OS. Your example is collated
correctly in all the es_ES locales on my Linux box, and also in
es_ES.UTF-8 on Solaris 10.
We have no idea what data you collected to assert 'whatever platform we
use'. UTF-8 locales on Mac OS X are the only instance in C where I am
aware of the use of Unicode point order (quite a few scripting languages
do it, though). If the problem were widespread I would expect it to be
reported more than it is (and 'ls' output is locale-specific in recent
versions of Linux, and my IT team did get several help requests about
that).
Collation is a tricky area, but that does not mean that OS designers are
in general shy of it. There was a concerted project, the Unicode
Collation Algorithm, and several OSes have implementations including
national 'tailorings'.
What can be done about it? The obvious answer is to use a reliable OS.
Alternatively, R is making use of the system's C collation functions and
those could be replaced. In current R (>= 2.7.0) this is centralized in
src/main/utils.c, in the code (not Windows)
# ifdef HAVE_STRCOLL
# define STRCOLL strcoll
# else
# define STRCOLL strcmp
# endif
int Scollate(SEXP a, SEXP b)
{
return STRCOLL(translateChar(a), translateChar(b));
}
Mac OS X has strcoll (it is a C99 function, so that test is historical),
and what would be needed would be to replace it by a more functional
version. My suspicion is that Mac OS X does have proper collation
functionality (http://en.wikipedia.org/wiki/Common_Locale_Data_Repository
appears to claim it uses CDLR data), but that it is not used in the ISO
C99 part of the OS. For example, Cocoa seems to have a function
'localizedCompare'.
BTW, Ei-ji Nakama was already replaced the broken wctype and wcwidth
functions in Mac OS: see file src/main/rlocale.c
On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:
Hi, This issue comes from a thread of the same title, "a question of alphabetical order", initiated yesterday in r-help at r-project.org list. As it affects now only Mac environment, I follow Brian Ripley's advice and move it to this list. It is now clear that ordering lists/variable values is a kind of nightmare whatever platform we use. As I (and possible many others!) need to get a right order, or an "as right as possible" order, for list of strings using non-ASCII character, namely ?????, ????? and ?,?, we have been considering a number of options. Hans-Joerg Bibiko proposed a customized function to do the trick. Brian Ripley spoke about es_ES.ISO8859-15 doing almost the right thing for these characters. Here what I get working in a MacBook which environment I describe at the bottom of the message: http://mire.environmentalchange.net/~webmaster/images/toPlot.png Here the code: png(file="toPlot.png", pointsize = 14, width = 1000, height = 480, units = "px", bg="#eaedd5") Sys.setlocale(category = "LC_ALL", locale = "es_ES.ISO8859-15") toPlot <- data.frame(medio=c("avi?n", "barco", "bicicleta", "?ngulo", "choco", "cami?n", "coche", "tren", "aleta", "luna", "llave"), variable=c(34, 33, 3, 37, 54, 23, 67, 30, 23, 56, 13)) toPlot<-toPlot[order(toPlot$medio),] Sys.setlocale(category = "LC_ALL", locale = "en_GB.UTF-8") barplot(toPlot$variable,names.arg=toPlot$medio) dev.off() As you see in the order of labels, accent is not ignored, and ch and ll are considered as single instances. These are not longer the case with Spanish alphabetical order. It changed in 1994. So, Hans's solution seems the only one available to the correct order. At least working with in the environment described below. In any case, please, 1. Are you aware of any new locale we could try to see if it is already updated? 2. If it doesn't exist, how/where must we go to propose/start creating such e locale? Here the environment:
version
_ platform i386-apple-darwin9.2.2 arch i386 os darwin9.2.2 system i386, darwin9.2.2 status beta major 2 minor 7.0 year 2008 month 04 day 12 svn rev 45280 language R version.string R version 2.7.0 beta (2008-04-12 r45280)
sessionInfo()
R version 2.7.0 beta (2008-04-12 r45280) i386-apple-darwin9.2.2 locale: en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base
R GUI 1.24-devel (5072) Thank you so much for your help, Ricardo -- Ricardo Rodr?guez Your XEN ICT Team
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595