Skip to content
Prev 41885 / 63424 Next

gsub, utf-8 replacements and the C-locale

Hi all,

I'd like to discuss a infelicity/possible bug with gsub.  Take the
following function:

f <- function(x) {
  gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x))
}

As you might expect, in utf-8 locales it is idempotent:

Sys.setlocale("LC_ALL", "UTF-8")
f("x y")
# [1] "x y"

But in the C locale it is not:

Sys.setlocale("LC_ALL", "C")
f("x y")
# [1] "x\302\240y"

This seems weird to me. (And caused a bug in a package because I
didn't realise some windows users have a non-utf8 locale)

I'm not sure what the correct resolution is.  Should the encoding of
the output of gsub be utf-8 if either the input or replacement is
utf-8?  In non-utf-8 locales should the encoding of "\u{A0}" be bytes?

Hadley