Windows, format.POSIXct and character encodings
On May 1, 2013, at 5:33 PM, Simon Urbanek wrote:
On May 1, 2013, at 10:06 AM, Hadley Wickham wrote:
Hi all,
In what encoding does format.POSIXct return its output? It doesn't
seem to be utf-8:
Sys.setlocale("LC_ALL", "Japanese_Japan.932")
times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")
ampm <- format(as.POSIXct(times), format = "%p")
x <- gsub(">", "*", paste(ampm, collapse = "+>"))
y <- "??+*??"
identical(x, y)
# [1] TRUE
# But, confusingly, ...
charToRaw(x)
# [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c
charToRaw(y)
# [1] 8c df 91 4f 2b 2a 8c df 8c e3
That's not confusing at all:
Encoding(x)
[1] "UTF-8"
Encoding(y)
[1] "unknown" The first string is in UTF-8 the second is in the local locale (here 932).
# So there's at least a small bug with identical
Nope: ?identical "Character strings are regarded as identical if they are in different marked encodings but would agree when translated to UTF-8."
# And this causes a problem when you attempt to do
# stuff with the string
gsub("+", "*", x, fixed = T)
# Error in gsub("+", "*", x, fixed = T) :
# invalid multibyte string at '<8c>'
gsub("+", "*", y, fixed = T)
# [1] "??**??"
This is where the problem lies - and it has nothing to do with format:
z=enc2utf8("??+*??")
gsub("+", "*", z, fixed = T)
Error in gsub("+", "*", z, fixed = T) :
invalid multibyte string at '<8c>'
The cause is that fgrep_one() gives higher precedence to mbcslocale than use_UTF8 so the grep is actually done in the MBCS locale and not UTF-8. Consequently, you'll see this only in multi-byte locales other than UTF-8, so on let's say OS X you can reproduce it with
x="??+*??"
gsub("+", "*", x, fixed = T)
Error in gsub("+", "*", x, fixed = T) :
invalid multibyte string at '<8c>'
This should have been
Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
x="??+*??" Encoding(x)
[1] "UTF-8"
Sys.setlocale("LC_ALL", "ja_JP.SJIS")
[1] "ja_JP.SJIS/ja_JP.SJIS/ja_JP.SJIS/C/ja_JP.SJIS/en_US.UTF-8"
gsub("+", "*", x, fixed = T)
Error in gsub("+", "*", x, fixed = T) :
invalid multibyte string at '<8c>'
Cheers,
S
Inverting the precedence would fix this issue, but I'm not sure if it would have unwanted side-effects on MBCS locales ... Cheers, Simon
My session info is R version 3.0.0 (2013-04-03) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 [3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C [5] LC_TIME=Japanese_Japan.932 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.0.0 Any ideas? Thanks! Hadley -- Chief Scientist, RStudio http://had.co.nz/
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel