I'm trying to find information about how to use Rprintf with a UTF-8 encoded string, and I'm not sure what the right cross-platform usage is. I found an earlier thread about this (http://r.789695.n4.nabble.com/How-to-print-UTF-8-encoded-strings-from-a-C-routine-to-R-s-output-td4724337.html) but it wasn't very helpful. If I want to print a UTF-8 string, I can do one of the following: 1) Send native data via Rprintf("%s", translateChar(str)); 2) Send UTF-8 data via Rprintf("%s", translateCharUTF8(str)); If Rprintf is sending its output to stdout, then (1) seems like the correct option. If Rprintf is sending to a file connection with encoding set to UTF-8 (for example, after a call to sink(file(..., encoding="UTF-8"))), then (2) is correct. Is there any way to know the encoding that Rprintf is expecting? Thanks, Patrick -- Patrick Perry Assistant Professor New York University
Rprintf expected encoding
2 messages · Patrick Perry, Duncan Murdoch
On 30/06/2017 4:24 PM, Patrick Perry wrote:
I'm trying to find information about how to use Rprintf with a UTF-8 encoded string, and I'm not sure what the right cross-platform usage is. I found an earlier thread about this (http://r.789695.n4.nabble.com/How-to-print-UTF-8-encoded-strings-from-a-C-routine-to-R-s-output-td4724337.html) but it wasn't very helpful. If I want to print a UTF-8 string, I can do one of the following: 1) Send native data via Rprintf("%s", translateChar(str)); 2) Send UTF-8 data via Rprintf("%s", translateCharUTF8(str)); If Rprintf is sending its output to stdout, then (1) seems like the correct option. If Rprintf is sending to a file connection with encoding set to UTF-8 (for example, after a call to sink(file(..., encoding="UTF-8"))), then (2) is correct. Is there any way to know the encoding that Rprintf is expecting?
It always expects the native encoding. If the output connection is UTF-8 encoded, it will translate from native to UTF-8 as it writes. Things will hopefully change in R 3.5.0, since the translation from UTF-8 to native to UTF-8 can lose information (and is inefficient even if not lossy). I think old code should behave as it did in the past, but there will be a way to say that the incoming string is in UTF-8. Duncan Murdoch