A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself. Thanks, Stuart R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Platform: x86_64-unknown-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF -8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_ NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICA TION=C" "-1" > "1" [1] TRUE "-1" <"1" [1] FALSE "1" < "-1" [1] TRUE "1" < "-" [1] FALSE Vs. R version 3.1.1 (2014-07-10) ? ?Sock it to Me" Platform: x86_64-redhat-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8 ;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME =C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION =C" "-1" > "1" [1] FALSE "-1" <"1" [1] TRUE "1" < "-1" [1] FALSE "1" < "-" [1] FALSE
R string comparisons may vary with platform (plain text)
8 messages · Stuart Ambler, Duncan Murdoch, Henrik Bengtsson +4 more
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison. Duncan Murdoch
Thanks, Stuart R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Platform: x86_64-unknown-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF -8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_ NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICA TION=C" "-1" > "1" [1] TRUE "-1" <"1" [1] FALSE "1" < "-1" [1] TRUE "1" < "-" [1] FALSE Vs. R version 3.1.1 (2014-07-10) ? ?Sock it to Me" Platform: x86_64-redhat-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8 ;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME =C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION =C" "-1" > "1" [1] FALSE "-1" <"1" [1] TRUE "1" < "-1" [1] FALSE "1" < "-" [1] FALSE
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
You mean where it says that some platforms may not respect the locale (I assume, though don?t know, that en_US.UTF-8 and en_US.utf8 would be the same)? But I gather that the general problem has been looked into and is difficult to solve; thanks.
On 11/22/14, 12:42 PM, "Duncan Murdoch" <murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison. Duncan Murdoch
Thanks, Stuart R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Platform: x86_64-unknown-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U TF -8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;L C_ NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFI CA TION=C" "-1" > "1" [1] TRUE "-1" <"1" [1] FALSE "1" < "-1" [1] TRUE "1" < "-" [1] FALSE Vs. R version 3.1.1 (2014-07-10) ? ?Sock it to Me" Platform: x86_64-redhat-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf 8 ;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAM E =C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATIO N =C" "-1" > "1" [1] FALSE "-1" <"1" [1] TRUE "1" < "-1" [1] FALSE "1" < "-" [1] FALSE
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison.
With the oddity that both platforms use what look like similar locales: LC_COLLATE=en_US.UTF-8 LC_COLLATE=en_US.utf8 /Henrik
Duncan Murdoch
Thanks, Stuart R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Platform: x86_64-unknown-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF -8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_ NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICA TION=C" "-1" > "1" [1] TRUE "-1" <"1" [1] FALSE "1" < "-1" [1] TRUE "1" < "-" [1] FALSE Vs. R version 3.1.1 (2014-07-10) ? ?Sock it to Me" Platform: x86_64-redhat-linux-gnu (64-bit) Sys.getlocale() [1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8 ;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME =C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION =C" "-1" > "1" [1] FALSE "-1" <"1" [1] TRUE "1" < "-1" [1] FALSE "1" < "-" [1] FALSE
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote: On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison.
With the oddity that both platforms use what look like similar locales: LC_COLLATE=en_US.UTF-8 LC_COLLATE=en_US.utf8
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to. As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm. In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On 23/11/2014 09:39, peter dalgaard wrote:
On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote: On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison.
With the oddity that both platforms use what look like similar locales: LC_COLLATE=en_US.UTF-8 LC_COLLATE=en_US.utf8
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to. As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.
And I have seen both with R. At the very least, check if ICU is being
used (capabilities("ICU") in current R, maybe not in some of the
obsolete versions seen in this thread).
As a further possibility, there are choices in the UCA (in R, see
?icuSetCollate) and ICU can be compiled with different default choices.
It is not clear to me what (if any) difference ICU versions make, but
in R-devel extSoftVersion() reports that.
In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.
As ?Comparison has long said.
Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK
For many scientific applications one is really dealing with ASCII characters and LC_COLLATE="C", even if the user is running in non-C locales. What robust approaches (if any?) are available to write code that sorts in a locale-independent way? The Note in ?Sys.setlocale is not overly optimistic about setting the locale within a session. Martin Morgan
On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:
On 23/11/2014 09:39, peter dalgaard wrote:
On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote: On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison.
With the oddity that both platforms use what look like similar locales: LC_COLLATE=en_US.UTF-8 LC_COLLATE=en_US.utf8
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to. As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.
And I have seen both with R. At the very least, check if ICU is being used
(capabilities("ICU") in current R, maybe not in some of the obsolete versions
seen in this thread).
As a further possibility, there are choices in the UCA (in R, see
?icuSetCollate) and ICU can be compiled with different default choices. It is
not clear to me what (if any) difference ICU versions make, but in R-devel
extSoftVersion() reports that.
In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.
As ?Comparison has long said.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
The 'stringi' package claims robust cross-platform performance. It exports much functionality of the ICU library and will attempt to install it when not present. The function 'stri_sort' accepts a collation argument that can be defined with 'stri_opts_collator'. On Sun, Nov 23, 2014 at 5:15 PM, Martin Morgan <mtmorgan at fredhutch.org> wrote:
For many scientific applications one is really dealing with ASCII characters and LC_COLLATE="C", even if the user is running in non-C locales. What robust approaches (if any?) are available to write code that sorts in a locale-independent way? The Note in ?Sys.setlocale is not overly optimistic about setting the locale within a session. Martin Morgan On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:
On 23/11/2014 09:39, peter dalgaard wrote:
On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu>
wrote: On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
A colleague?s R program behaved differently when I ran it, and we thought we traced it probably to different results from string comparisons as below, with different R versions. However the platforms also differed. A friend ran it on a few machines and found that the comparison behavior didn?t correlate with R version, but rather with platform. I wonder if you?ve seen this. If it?s not some setting I?m unaware of, maybe someone should look into it. Sorry I haven?t taken the time to read the source code myself.
Looks like a collation order issue. See ?Comparison.
With the oddity that both platforms use what look like similar locales: LC_COLLATE=en_US.UTF-8 LC_COLLATE=en_US.utf8
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at http://stackoverflow.com/questions/19967555/postgres- collation-differences-osx-v-ubuntu which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to. As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.
And I have seen both with R. At the very least, check if ICU is being
used
(capabilities("ICU") in current R, maybe not in some of the obsolete
versions
seen in this thread).
As a further possibility, there are choices in the UCA (in R, see
?icuSetCollate) and ICU can be compiled with different default choices.
It is
not clear to me what (if any) difference ICU versions make, but in R-devel
extSoftVersion() reports that.
In general, collation is a minefield: Some languages have the same
letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc. As ?Comparison has long said.
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel