Skip to content
Prev 6415 / 12125 Next

[R-pkg-devel] Package Encoding and Literal Strings

On 12/17/20 5:17 PM, joris at jorisgoosen.nl wrote:
The number of people using too old version of Windows should be small 
when this could become ready for production. Windows 8.1. is still 
supported, but there is the free upgrade to Windows 10 (also from no 
longer supported Windows 7), so this should not be a problem for desktop 
machines. It will be a problem for servers.
String literals may be turned into local encoding because that is how 
R/packages/external software is written - it needs native encoding. 
Hacks here come when such code is given a string not in the local 
encoding, assuming that under some conditions such code will work. This 
includes a part of the parser and a hack to implement argument 
"encoding" of "parse()", which allows to parse (non-representable) UTF-8 
strings when running in a single-byte locale such as latin 1 (see ?parse).
UTF-8 is supported in R on Windows in many ways, as documented. As long 
as you are using UTF-8 strings representable in the current encoding, so 
that they can be converted to native encoding and back without problems, 
you are fine, R will do the conversions as needed. The troubles come 
when such conversion is not possible. In the example of the parser, 
without the "encoding=" argument to "parse()", the parser will just work 
on any text you give to it, even when the text is in UTF-8: it will work 
by first converting to native encoding and then doing the parsing, no 
hacks involved. When interacting with external software, you'd just tell 
R to provide the strings in the encoding needed by that external 
software, so possibly UTF-8, so possibly convert, but all would work 
fine. The problem are characters not representable in the native encoding.
You mean the memory representation? For that there would be R Internals 
and the sources, essentially there are CHARSXP objects which include an 
encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would 
not access these objects directly, instead use translateChar() if you 
needed strings them in native encoding or translateCharUTF8() if in 
UTF-8, and this is documented in Writing R Extensions.

I think it would be really good if you could provide a complete, minimal 
reproducible example of your problem. It may be there is some 
misunderstanding, especially if you are working with characters 
representable in the current encoding, there should be no problem.
I understand, also it may take a bit of time before this would become 
stable.

Best
Tomas