gsub, utf-8 replacements and the C-locale - R-devel

Hadley Wickham · 2011-11-23T23:48:16Z

Hi all, I'd like to discuss a infelicity/possible bug with gsub. Take the following function: f <- function(x) { gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x)) } As you might expect, in utf-8 locales it is idempotent: Sys.setlocale("LC_ALL", "UTF-8") f("x y") # [1] "x y" But in the C locale it is not: Sys.setlocale("LC_ALL", "C") f("x y") # [1] "x\302\240y" This seems weird to me. (And caused a bug in a package because I didn't realise some windows users have a non-utf8 locale) I'm not s

Simon Urbanek

Wed, Nov 23, 2011 4:06 PM #

On Nov 23, 2011, at 6:48 PM, Hadley Wickham wrote:

It is if the input is UTF-8 but only then - that is what is causing the asymmetry. Part of the problem is that you cannot declare 7-bit string as UTF-8 (even though it is valid) so you can't work around it by forcing the encoding.

No, because the whole point of the encoding is to define the content. "\ua0" defines one unicode character whereas "\302\240" defines two bytes with unknown meaning. The meaning of UTF-8 encoded strings is still valid in non-UTF-8 locales and the reason why your can work with UTF-8 strings in R irrespective of the locale (very useful thing).

I would suggest to handle the special case of 7-bit input and UTF-8 replacement such that it results in UTF-8 output (as opposed to bytes output with happens now). The relevant code is somewhat convoluted (and more so in R-devel) so I'm not volunteering to do it, though.

Just to make things more clear - the current result (in C locale):

[1] "foo\302\240bar"

Possibly desired result:

[1] "foo<U+00A0>bar"

Cheers,
Simon