Skip to content
Prev 56650 / 63424 Next

R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

On 4/10/19 6:32 PM, Jeroen Ooms wrote:
Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to 
convert the input to native encoding before passing it to R, which is 
based on locales. However, that string is passed by R to the parser, 
which Rgui takes advantage of and converts non-representable characters 
to their \uxxxx escapes which are understood by the parser. Using this 
trick, Unicode characters can get to the parser from Rgui (but of course 
then still in risk of conversion later when the program runs). Rgui only 
escapes characters that cannot be represented, unfortunately, the 
standard C99 API for that implemented on Windows does the best fit. This 
could be fixed in Rgui by calling a special Windows API function and 
could be done, but with the mentioned risk that it would break existing 
uses that capture the existing behavior.

This is the only place I know of where removing best fit would lead to 
correct representation of UTF-8 characters. Other places will give NA, 
some other escapes, code will fail to parse (e.g. "incomplete string", 
one can get that easily with source()).

Tomas

Thread (13 messages)

Tomáš Bořil R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Tomas Kalibera R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Jeroen Ooms R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Tomas Kalibera R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Yihui Xie R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Duncan Murdoch R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Jeroen Ooms R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Duncan Murdoch R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Tomas Kalibera R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Tomáš Bořil R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 10 Tomáš Bořil R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 11 Tomas Kalibera R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 11 Tomáš Bořil R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones Apr 11