Skip to content

gsub, utf-8 replacements and the C-locale

2 messages · Hadley Wickham, Simon Urbanek

#
Hi all,

I'd like to discuss a infelicity/possible bug with gsub.  Take the
following function:

f <- function(x) {
  gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x))
}

As you might expect, in utf-8 locales it is idempotent:

Sys.setlocale("LC_ALL", "UTF-8")
f("x y")
# [1] "x y"

But in the C locale it is not:

Sys.setlocale("LC_ALL", "C")
f("x y")
# [1] "x\302\240y"

This seems weird to me. (And caused a bug in a package because I
didn't realise some windows users have a non-utf8 locale)

I'm not sure what the correct resolution is.  Should the encoding of
the output of gsub be utf-8 if either the input or replacement is
utf-8?  In non-utf-8 locales should the encoding of "\u{A0}" be bytes?

Hadley
#
On Nov 23, 2011, at 6:48 PM, Hadley Wickham wrote:

            
It is if the input is UTF-8 but only then - that is what is causing the asymmetry. Part of the problem is that you cannot declare 7-bit string as UTF-8 (even though it is valid) so you can't work around it by forcing the encoding.
No, because the whole point of the encoding is to define the content. "\ua0" defines one unicode character whereas "\302\240" defines two bytes with unknown meaning. The meaning of UTF-8 encoded strings is still valid in non-UTF-8 locales and the reason why your can work with UTF-8 strings in R irrespective of the locale (very useful thing).

I would suggest to handle the special case of 7-bit input and UTF-8 replacement such that it results in UTF-8 output (as opposed to bytes output with happens now). The relevant code is somewhat convoluted (and more so in R-devel) so I'm not volunteering to do it, though.

Just to make things more clear - the current result (in C locale):
[1] "foo\302\240bar"

Possibly desired result:
[1] "foo<U+00A0>bar"

Cheers,
Simon