Skip to content

R string comparisons may vary with platform (plain text)

8 messages · Stuart Ambler, Duncan Murdoch, Henrik Bengtsson +4 more

#
A colleague?s R program behaved differently when I ran it, and we thought
we traced it probably to different results from string comparisons as
below, with different R versions.  However the platforms also differed.  A
friend ran it on a few machines and found that the comparison behavior
didn?t correlate with R version, but rather with platform.

I wonder if you?ve seen this.  If it?s not some setting I?m unaware of,
maybe someone should look into it.  Sorry I haven?t taken the time to read
the source code myself.

Thanks,
Stuart

R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Platform: x86_64-unknown-linux-gnu (64-bit)
Sys.getlocale()
[1] 
"LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF
-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_
NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICA
TION=C"

"-1" > "1"
[1] TRUE

"-1" <"1"
[1] FALSE

"1" < "-1"
[1] TRUE

"1" < "-"
[1] FALSE

Vs.

R version 3.1.1 (2014-07-10) ? ?Sock it to Me"
Platform: x86_64-redhat-linux-gnu (64-bit)
Sys.getlocale()
[1] 
"LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8
;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME
=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION
=C"

"-1" > "1"
[1] FALSE

"-1" <"1"
[1] TRUE

"1" < "-1"
[1] FALSE

"1" < "-"
[1] FALSE
#
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
Looks like a collation order issue.  See ?Comparison.

Duncan Murdoch
#
You mean where it says that some platforms may not respect the locale (I
assume, though don?t know, that en_US.UTF-8 and en_US.utf8 would be the
same)?  But I gather that the general problem has been looked into and is
difficult to solve; thanks.
On 11/22/14, 12:42 PM, "Duncan Murdoch" <murdoch.duncan at gmail.com> wrote:

            
#
On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
With the oddity that both platforms use what look like similar locales:

LC_COLLATE=en_US.UTF-8
LC_COLLATE=en_US.utf8

/Henrik
#
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at

http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu

which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to.

As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.

In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.
#
On 23/11/2014 09:39, peter dalgaard wrote:
And I have seen both with R.  At the very least, check if ICU is being 
used (capabilities("ICU") in current R, maybe not in some of the 
obsolete versions seen in this thread).

As a further possibility, there are choices in the UCA (in R, see 
?icuSetCollate) and ICU can be compiled with different default choices. 
  It is not clear to me what (if any) difference ICU versions make, but 
in R-devel extSoftVersion() reports that.
As ?Comparison has long said.
#
For many scientific applications one is really dealing with ASCII characters and 
LC_COLLATE="C", even if the user is running in non-C locales. What robust 
approaches (if any?) are available to write code that sorts in a 
locale-independent way? The Note in ?Sys.setlocale is not overly optimistic 
about setting the locale within a session.

Martin Morgan
On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:

  
    
#
The 'stringi' package claims robust cross-platform performance. It exports
much functionality of the ICU library and will attempt to install it when
not present.
The function 'stri_sort' accepts a collation argument that can be defined
with 'stri_opts_collator'.




On Sun, Nov 23, 2014 at 5:15 PM, Martin Morgan <mtmorgan at fredhutch.org>
wrote: