R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones - R-devel

Wed, Apr 10, 2019 9:13 AM #

Yes, again in a script sourced by source(encoding = ...). But also by
typing it directly in R console.

Most of the time, I use RStudio as a front-end. For this experiment, I
also verified it in Rgui. In both front-ends, it behaves completely in
the same way.

An optional parameter to source() function which would translate all
UTF-8 characters in string literals to their "\Uxxxx" codes sounds as
a great idea (and I hope it would fix 99.9% of problems I have -
because that is the way I overcome these problems nowadays) - and the
same behaviour in command line...

Tomas

On Wed, Apr 10, 2019 at 5:29 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:

On 4/10/19 3:02 PM, Tom?? Bo?il wrote:

The thing is, I would rather prefer R (in that rare occasions where an
old function does not support anything but ANSI encoding) throwing  an
error:
"Unicode encoding not supported, please change the string in your
code" instead of silently converting some characters to different ones
without any warning.

In principle it probably could be optional as Yihui Xie asks on R-devel,
we will discuss that internally. If the Windows "best fit" is a big
problem on its own, this is something that could be done quickly, if
optional. We could turn into error only conversions that we have control
of (inside R code), indeed, but that should be most.

I understand that there are some functions which are not
Unicode-compatible yet but according to the Stackoverflow discussion I
cited before, in many cases (90% or more?) everything works right with
Encoding("\U00159") == "UTF-8" (in my scripts, I have not found any
problem with explicit UTF-8 coding yet).

Well there has been a lot of effort invested to make that possible, so
that many internal string functions do not convert unnecessarily into
UTF-8, mostly by Duncan Murdoch, but much more needs to be done and
there is the problem with packages. Of course if you find a concrete R
function that unnecessarily converts (source() is debatable, I know
about it, so some other), you are welcome to report, I or someone can
fix. A common problem is I/O (connections) and there the fix won't be
easy, it would have to be re-designed. The problem is that when we have
something typed "char *" inside R, it needs to be always in native
encoding, any mix would lead to total chaos.

The full solution would however only be fully switching to UTF-8
internally on Windows (and then char * would always mean UTF-8), we have
discussed this many times inside R Core (and many times before I
joined), I am sure it will be discussed again at some point and we are
aware of course of the problem. Please trust us it is hard to do - we
know the code as we (collectively) have written it. People contributing
to SO are users and package developers, not developers of the core. You
can get more correct information from people on R-devel (package
developers and sometimes core developers).

  The only problem is that I
cannot simple use enc2utf8("?") - it is converted to "o" before
executing the function. Instead of that, I have to explicitly type
"\U00159" throughout my code.

What do you mean it is "converted before"? Under what context? Again a
script sourced by source(encoding=) ?

And, are you using Rgui as front-end?

In my lectures, I have Czech, Russian and English students and it is
also impossible to create a script that works for everyone. In fact, I
know that Czech "?" can be translated to my native (Czech) encoding. I
have just chosen the example as it is reproducible in English locale.

Originally, I had a problem with IPA characted (phonetic symbol) "?",
i.e. "\U00153". In Czech locale, it is translated to "o". In English,
it is not converted - it remains "?". But if I use "\U00153" in Czech
locale, nothing is converted and everything works right.

Yes, the \u* sequence I hear is commonly used to represent UTF-8 string
literals in something that is not UTF-8 itself. Note if you have a
package, you can have R source files with UTF-8 encoded literal strings
if you declare Encoding: UTF-8 in the DESCRIPTION file (see Writing R
Extensions for details), even though sometimes people run into
trouble/bugs as well.

You probably know none of these problems exist on Linux nor macOS, where
UTF-8 is the native encoding.

Tomas

Tomas



On Wed, Apr 10, 2019 at 2:37 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:

On 4/10/19 2:06 PM, Tom?? Bo?il wrote:

Thank you for the explanation but I just do not understand one thing - why it would need to recreate the R from a scratch to work with Unicode internally?

If I call the script with
eval(parse("script.R", encoding = "UTF-8"))
it works perfectly - it looks like R functions already support Unicode. When I type "\U00159", R also has no problem with that.

Well there is support for unicode, but the problem is that at some point translation to native encoding is needed. The parser does not do that, nothing you call in your example script does it, but many other functions do. Note that you can use UTF-8 without problems as long as you only have characters that can be represented also in the current native encoding. So, if you run in a Czech locale, Czech characters in UTF-8 will work fine, just they will sometimes be translated to corresponding Czech characters in your native encoding.

If you want to learn more about encodings in R, look at ?Encoding, Writing R Extensions, etc. In principle, ever R object representing a string has a flag whether the string is in UTF-8, in latin1, or in current native encoding. But C structures typed "char *" almost always are in current native encoding, any mixture would lead to chaos. Most functions operating on strings have to specially handle UTF-8, MBCS encodings, ASCII, etc. All of that would have to be rewritten. Many Windows API calls are still using the native encoding version (some can use UTF16-LE via conversion from UTF-8 or other encodings).

In principle, it should work to have UTF-8 coded string constants in R programs, and definitely so if you use \uxxxx (see Writing R Extensions for details). But you should always run in a native encoding where these characters can be represented, otherwise it may or may not work, depending on which functions you call.

Tomas

Thanks,
Tomas

st 10. 4. 2019 v 13:52 odes?latel Tomas Kalibera <tomas.kalibera at gmail.com> napsal:

On 4/10/19 1:35 PM, Tom?? Bo?il wrote:

Which users make their code depending on an automatic conversion which
behaves differently in each Europe country, but only on Windows?

I meant the "best fit". The same R scripts for the same data sets would
be returning different results, people capture existing behavior without
necessarily knowing about it. Removing the "best fit" would not remove
the translation to native encoding, you would get NA or some escape
sequence/character code number instead of the "best fit" character.  It
would not solve the problem.

The real problem is that the conversion to native encoding happens. This
question has been discussed many times before, but in short, it would
take probably many 1000s of hours of developer time to rewrite R to use
UTF-8 internally, but convert to UTF16-LE in all Windows API calls. It
will cause changes to documented behavior. What may not be obvious,
there is a problem with package code written in C/C++ that ignores
encoding flags (that is almost all native code in packages). That code
will stop working and there will be no way to test - because the input
data in the contributed examples/tests are ASCII.

If Windows start supporting UTF-8 as native encoding, the fix will be a
lot easier (I hope ~100hours), and without the compatibility problems -
just users who would wish to use UTF-8 as native encoding will be
affected, and things will probably work for them even with poorly
written packages.

Tomas

If someone needs the explicit conversion, he can call the iconv() function.

Much more people using R for text processing are frustrated they can
code only in ASCII (0-255), even though their code is saved in
Unicode.

Tomas




On Wed, Apr 10, 2019 at 1:26 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:

On 4/10/19 1:14 PM, Jeroen Ooms wrote:

On Wed, Apr 10, 2019 at 12:19 PM Tom?? Bo?il <borilt at gmail.com> wrote:

Minimalistic example:
Let's type "?" (LATIN SMALL LETTER R WITH CARON) in RGui console:

"?"

[1] "r"

Although the script is in UTF-8, the characters are replaced by
"simplified" substitutes uncontrollably (depending on OS locale). The
same goes with simply entering the code statements in R Console.

The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)

I think this is a "feature" of win_iconv that is bundled with base R
on Windows (./src/extra/win_iconv). The character from your example is
not part of the latin1 (iso-8859-1) set, however, win-iconv seems to
do so anyway:

x <- "\U00159"
print(x)

[1] "?"

iconv(x, 'UTF-8', 'iso-8859-1')

[1] "r"

On MacOS, iconv tells us this character cannot be represented as latin1:

x <- "\U00159"
print(x)

[1] "?"

iconv(x, 'UTF-8', 'iso-8859-1')

[1] NA

I'm actually not sure why base-R needs win_iconv (but I'm not an
encoding expert at all). Perhaps we could try to unbundle it and use
the standard libiconv provided by the Rtools toolchain bundle to get
more consistent results.

win_iconv just calls into Windows API to do the conversion, it is
technically easy to disable the "best fit" conversion, but I think it
won't be a good idea. In some cases, perhaps rare, the best fit is good,
actually including the conversion from "?" to "r" which makes perfect
sense. But more importantly, changing the behavior could affect users
who expect the substitution to happen because it has been happening for
many years, and it won't help others much.

Tomas

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel