Skip to content

why does [A-Z] include 'T' in an Estonian locale?

2 messages · Peter Dalgaard, Ben Bolker

#
Just for amusement: Similar messups occur with Danish and its three extra letters:
[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE

So for character ranges, the order is ?,?,? (which is how they'd collate in Swedish, except that Swedish uses diacriticals rather than ? and ?).
[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "?" "?" "?"

  
    
#
Yes.
   FWIW I submitted a request for a documentation fix to TRE (to 
document that it actually uses Unicode order, not collation order, to 
define ranges, just like most (but not all) other regex engines ...)

https://github.com/laurikari/tre/issues/88
On 2023-06-16 5:16 a.m., peter dalgaard wrote: