why does [A-Z] include 'T' in an Estonian locale?

Just for amusement: Similar messups occur with Danish and its three extra letters:
Sys.setlocale("LC_ALL", "da_DK")
[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"
sort(c(LETTERS,"?","?","?"))
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] TRUE
grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] TRUE
grepl("[A-?]", "?")
[1] TRUE

So for character ranges, the order is ?,?,? (which is how they'd collate in Swedish, except that Swedish uses diacriticals rather than ? and ?).
Sys.setlocale("LC_ALL", "sv_SE")
[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"
sort(c(LETTERS,"?","?","?"))
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
sort(c(LETTERS,"?","?","?"))
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
On 30 May 2023, at 17:45 , Ben Bolker <bbolker at gmail.com> wrote:

 Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> says that a range "is shorthand for the full range of characters between those two [endpoints] (inclusive) in the collating sequence".

Yet, T is *not* between A and Z in the Estonian collating sequence:

sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"

 I realize that this may be a question about TRE rather than about R *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), but I'm wondering if anyone has any insights ...  (and yes, I know that the correct answer is "use [:alpha:] and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to include are determined by Unicode code point ordering" - see

https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
Yes.
   FWIW I submitted a request for a documentation fix to TRE (to 
document that it actually uses Unicode order, not collation order, to 
define ranges, just like most (but not all) other regex engines ...)

https://github.com/laurikari/tre/issues/88
Just for amusement: Similar messups occur with Danish and its three extra letters:

Sys.setlocale("LC_ALL", "da_DK")
[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"
sort(c(LETTERS,"?","?","?"))
  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"

grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] TRUE
grepl("[A-?]", "?")
[1] FALSE
grepl("[A-?]", "?")
[1] TRUE
grepl("[A-?]", "?")
[1] TRUE

So for character ranges, the order is ?,?,? (which is how they'd collate in Swedish, except that Swedish uses diacriticals rather than ? and ?).

Sys.setlocale("LC_ALL", "sv_SE")
[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"
sort(c(LETTERS,"?","?","?"))
  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
sort(c(LETTERS,"?","?","?"))
  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"

On 30 May 2023, at 17:45 , Ben Bolker <bbolker at gmail.com> wrote:

  Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> says that a range "is shorthand for the full range of characters between those two [endpoints] (inclusive) in the collating sequence".

Yet, T is *not* between A and Z in the Estonian collating sequence:

sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"

  I realize that this may be a question about TRE rather than about R *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), but I'm wondering if anyone has any insights ...  (and yes, I know that the correct answer is "use [:alpha:] and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to include are determined by Unicode code point ordering" - see

https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
 > E-mail is sent at my convenience; I don't expect replies outside of 
working hours.