Skip to content
Prev 6428 / 12125 Next

[R-pkg-devel] Package Encoding and Literal Strings

Hi Joris,

thanks for the example. You can actually simply have Test.R assign the 
two variables and then run

Encoding(utf8StringsPkg1::mathotString)
charToRaw(utf8StringsPkg1::mathotString)
Encoding(utf8StringsPkg1::tao)
charToRaw(utf8StringsPkg1::tao)

I tried on Linux, Windows/UTF-8 (the experimental version) and 
Windows/latin-1 (released version). In all cases, both strings are 
converted to native encoding. The mathotString is converted to latin-1 
fine, because it is representable there. The tao string when running in 
latin-1 locale gets the escapes <xx>:

"<e9><99><b6><e5><be><b7><e5><ba><86>"

Btw, the parse(,encoding="UTF-8") hack works, when you parse the 
modified Test.R file (with the two assignments), and eval the output, 
you will get those strings in UTF-8. But when you don't eval and print 
the parse tree in Rgui, it will not be printed correctly (again a 
limitation of these hacks, they could only do so much).

When accessing strings from C, you should always be prepared for any 
encoding in a CHARSXP, so when you want UTF-8, use "translateCharUTF8()" 
instead of "CHAR()". That will work fine on representable strings like 
mathotString, and that is conceptually the correct way to access them.

Strings that cannot be represented in the native encoding like tao will 
get the escapes, and so cannot be converted back to UTF-8. This is not 
great, but I? see it was the case already in 3.6 (so not a recent 
regression) and I don't think it would be worth the time trying to fix 
that - as discussed earlier, only switching to UTF-8 would fix all of 
these translations, not just one. Btw, the example works fine on the 
experimentation UTF-8 build on Windows.

I am sorry there is not a simple fix for non-representable characters.

Best
Tomas
On 12/18/20 1:53 PM, joris at jorisgoosen.nl wrote: