Skip to content
Prev 60437 / 63424 Next

gsub() hex character range problems in R-devel?

Hi Martin,

I'd add few comments to the excellent analysis of Brodie.

- \xhh is allowed and defined in Perl regular expressions, see ?regex 
(would need perl=TRUE), but to enter that in an R string, you need to 
escape the backslash.

- \xhh is not defined by POSIX for extended regular expressions, neither 
it is documented in ?regex for those; TRE supports it, but still 
portable programs should not rely on that

- literal \xhh in an R string is turned to the byte by R, but I would 
say this should not be used at all by users, because the result is 
encoding specific

- use of \u and \U in an R string is fine, it has well defined semantics 
and the corresponding string will then be flagged UTF-8 in R (so e.g. 
\ua0 is fine to represent the Unicode no-break space)

- see caveats of using character ranges with POSIX extended regular 
expressions in ?regex re encodings, using Perl regular expressions in 
UTF-8 mode is more reliable for those

So, a variant of your example might be:

 > gsub("[\\x7f-\\xff]", "", "fo\ua0o", perl=TRUE)
[1] "foo"

(note that the \ua0 ensures that the text is UTF-8, and hence the UTF-8 
mode for regular expressions is used, ?regex has more)

However, I think it is better to formulate regular expressions to cover 
all of Unicode, so do something like e.g. "only keep ASCII digits, ASCII 
space, ASCII underscore, but remove all other characters".

Best
Tomas
On 1/4/22 8:35 PM, Martin Morgan wrote: