Skip to content

R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

13 messages · Jeroen Ooms, Yihui Xie, Duncan Murdoch +2 more

#
Hello,

There is a long-lasting problem with processing UTF-8 source code in R
on Windows OS. As Windows do not have "UTF-8" locale and R passes
source code through OS before executing it, some characters are
"simplified" by the OS before processing, leading to undesirable
changes.

Minimalistic example:
Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui console:
[1] "r"

Let's assume the following script:
# file [script.R]
if ("?" != "\U00159") {
    stop("Problem: Unexpected character conversion.")
} else {
    cat("o.k.\n")
}

Problem:
source("script.R", encoding = "UTF-8")

OK (see https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
eval(parse("script.R", encoding = "UTF-8"))

Although the script is in UTF-8, the characters are replaced by
"simplified" substitutes uncontrollably (depending on OS locale). The
same goes with simply entering the code statements in R Console.

The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)

Best regards
Tomas Boril
_
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status         alpha
major          3
minor          6.0
year           2019
month          04
day            07
svn rev        76333
language       R
version.string R version 3.6.0 alpha (2019-04-07 r76333)
nickname
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
#
On 4/10/19 10:22 AM, Tom?? Bo?il wrote:
On my system with your example,
Error in eval(ei, envir) : Problem: Unexpected character conversion.
Error in eval(ei, envir) : Problem: Unexpected character conversion..
o.k.

Which is expected, unfortunately. As per documentation of ?source, the 
"encoding" argument tells source() that the input is in UTF-8, so that 
source() can convert it to the native encoding. Again as documented, 
parse() uses its encoding argument to mark the encoding of the strings, 
but it does not re-encode, and the character strings in the parsed 
result will as documented have the encoding mark (UTF-8 in this case).
Yes. By default, Windows uses "best fit" when translating characters to 
the native encoding. This could be changed in principle, but could break 
existing applications that may depend on it, and it won't really help 
because such characters cannot be represented anyway. You can find more 
in ?Encoding, but yes, it is a known problem frequently encountered by 
users and unless Windows starts supporting UTF-8 as native encoding, 
there is no easy fix (a version from Windows 10 Insider preview supports 
it, so maybe that is not completely hopeless). In theory you can 
carefully read the documentation and use only functions that can work 
with UTF-8 without converting to native encoding, but pragmatically, if 
you want to work with UTF-8 files in R, it is best to use a non-Windows 
platform.

Best
Tomas

  
  
#
On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com> wrote:
I think this is a "feature" of win_iconv that is bundled with base R
on Windows (./src/extra/win_iconv). The character from your example is
not part of the latin1 (iso-8859-1) set, however, win-iconv seems to
do so anyway:
[1] "?"
[1] "r"

On MacOS, iconv tells us this character cannot be represented as latin1:
[1] "?"
[1] NA

I'm actually not sure why base-R needs win_iconv (but I'm not an
encoding expert at all). Perhaps we could try to unbundle it and use
the standard libiconv provided by the Rtools toolchain bundle to get
more consistent results.
#
On 4/10/19 1:14 PM, Jeroen Ooms wrote:
win_iconv just calls into Windows API to do the conversion, it is 
technically easy to disable the "best fit" conversion, but I think it 
won't be a good idea. In some cases, perhaps rare, the best fit is good, 
actually including the conversion from "?" to "r" which makes perfect 
sense. But more importantly, changing the behavior could affect users 
who expect the substitution to happen because it has been happening for 
many years, and it won't help others much.

Tomas
#
Since it is "technically easy" to disable the best fit conversion and
the best fit is rarely good, how about providing an option for
code/package authors to disable it? I'm asking because this is one of
the most painful issues in packages that may need to source() code
containing UTF-8 characters that are not representable in the Windows
native encoding. Examples include knitr/rmarkdown and shiny. Basically
users won't be able to knit documents or run Shiny apps correctly when
the code contains characters that cannot be represented in the native
encoding.

Regards,
Yihui
--
https://yihui.name
On Wed, Apr 10, 2019 at 6:36 AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:
#
On 10/04/2019 10:29 a.m., Yihui Xie wrote:
Wouldn't things be worse with it disabled than currently?  I'd expect 
the line containing the "?" to end up as NA instead of converting to "r".

Of course, it would be best to be able to declare source files as UTF-8 
and avoid any conversion at all, but as Tomas said, that's a lot harder.

Duncan Murdoch
#
On Wed, Apr 10, 2019 at 5:45 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
I don't think it would be worse, because in this case R would not
implicitly convert strings to (best fit) latin1 on Windows, but
instead keep the (correct) string in its UTF-8 encoding. The NA only
appears if the user explicitly forces a conversion to latin1, which is
not the problem here I think.

The original problem that I can reproduce in RGui is that if you enter
 "?" in RGui, R opportunistically converts this to latin1, because it
can. However if you enter text which can definitely not be represented
in latin1, R encodes the string correctly in UTF-8 form.
#
On 10/04/2019 12:32 p.m., Jeroen Ooms wrote:
I think the pathways for text in RGui and text being sourced are 
different.  I agree fixing RGui in that way would make sense, but Yihui 
was talking about source().

Duncan Murdoch
#
On 4/10/19 6:32 PM, Jeroen Ooms wrote:
Rgui is a "Windows Unicode" application (uses UTF16-LE) but it needs to 
convert the input to native encoding before passing it to R, which is 
based on locales. However, that string is passed by R to the parser, 
which Rgui takes advantage of and converts non-representable characters 
to their \uxxxx escapes which are understood by the parser. Using this 
trick, Unicode characters can get to the parser from Rgui (but of course 
then still in risk of conversion later when the program runs). Rgui only 
escapes characters that cannot be represented, unfortunately, the 
standard C99 API for that implemented on Windows does the best fit. This 
could be fixed in Rgui by calling a special Windows API function and 
could be done, but with the mentioned risk that it would break existing 
uses that capture the existing behavior.

This is the only place I know of where removing best fit would lead to 
correct representation of UTF-8 characters. Other places will give NA, 
some other escapes, code will fail to parse (e.g. "incomplete string", 
one can get that easily with source()).

Tomas
#
For me, this would be a perfect solution.

I.e., do not use the ?best? fit and leave it to user?s competence:
a) in some functions, utf-8 works
b) in others -> error is thrown (e.g., incomplete string, NA, etc.)
=> user has to change the code with his/her intentional ?best fit string
literal substitute? or use another function that can handle utf-8.

Making an R code working right only on some platforms / trying to keep a
back-compatibility meaning ?the code does not do what you want and the
behaviour differs depending on each every locale but at least, it does not
throw an error? is generally not a good idea - it is dangerous. Users /
coders should know that there is something wrong with their strings and
some characters are ?eaten alive?.

Tomas

?t 11. 4. 2019 v 8:26 odes?latel Tomas Kalibera <tomas.kalibera at gmail.com>
napsal:

  
  
#
Or, if this cannot be done easily, please, disable the "utf-8" value
in source(..., ) function on Windows R.
source(..., encoding = "utf-8")
-> error: "utf-8" does not work right on Windows.
-> (or, at least) warning: "utf-8" is handled by "best fit" on Windows
and some characters in string literals may be automatically changed.

Because, at this state, the UTF-8 encoding of R source files on
Windows is a fake Unicode as it can handle only 256 different ANSI
characters in reality.

Thanks,
Tomas
On Thu, Apr 11, 2019 at 8:53 AM Tom?? Bo?il <borilt at gmail.com> wrote:
#
On 4/11/19 9:10 AM, Tom?? Bo?il wrote:
This is not a fair statement. source(,encoding="UTF-8") works as 
documented. It translates from (full) UTF-8 to current native encoding, 
which is documented. I believe the authors who made these design 
decisions over a decade ago, under different circumstances, and 
carefully implemented the code, tested, and documented for you to use 
for free, deserve to be addressed with some respect. It is not their 
responsibility to read the documentation for you, and if you had read 
and understood it, you would not have used source(,encoding="UTF-8") 
with characters not representable in current native encoding on Windows. 
The authors should not be blamed for that the design _today_ does not 
seem perfect for _todays_ systems (and how could they have guessed at 
that time Windows will still not support UTF-8 as native encoding today).

Tomas
#
I do not blame anybody and I do have a huge respect to all authors of
R. Actually, I like R very much and I would like to thank to everyone
who contributes to it. I use R regularly in my work (moved from Java,
C# and Matlab), I have created a package rPraat for phonetic analyses
and I think R is a very well designed language which will survive
decades. I am trying to bring new users (my students at non-technical
University) to use programming for their everyday problems
(statistics, phonetic analyses, text processing) and they enjoy R. I
am really positive in this (it is hard to express emotions in e-mails
without using emoticons in every sentence). And that is why I would
like it have even more perfect.

I only suggest to add one line of code (metaphorically) in source()
function in R for Windows to make it even better and to warn all users
who do not read a whole documentation for each function thoroughly and
carefully.

Tomas
On Thu, Apr 11, 2019 at 9:54 AM Tomas Kalibera <tomas.kalibera at gmail.com> wrote: