Prev 6415 / 12125 Next

[R-pkg-devel] Package Encoding and Literal Strings

Thu, Dec 17, 2020 9:22 AM

On 12/17/20 5:17 PM, joris at jorisgoosen.nl wrote:


On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalibera at gmail.com 
<mailto:tomas.kalibera at gmail.com>> wrote:

    On 12/16/20 11:07 PM, joris at jorisgoosen.nl
    <mailto:joris at jorisgoosen.nl> wrote:

    > David,
    >
    > Thanks for the response!
    >
    > So the problem is a bit worse then just setting

    `encoding="UTF-8"` on

    > functions like readLines.
    > I'll describe our setup a bit:
    > So we run R embedded in a separate executable and through a

    whole bunch of

    > C(++) magic get that to the main executable that runs the actual

    interface.

    > All the code that isn't R basically uses UTF-8. This works good

    and we've

    > made sure that all of our source code is encoded properly and

    I've verified

    > that for this particular problem at least my source file is

    definitely

    > encoded in UTF-8 (Ive checked a hexdump).
    >
    > The simplest solution, that we initially took, to get R+Windows to
    > cooperate with everything is to simply set the locale to "C" before
    > starting R. That way R simply assumes UTF-8 is native and

    everything worked

    > splendidly. Until of course a file needs to be opened in R that

    contains

    > some non-ASCII characters. I noticed the problem because a

    korean user had

    > hangul in his username and that broke everything. This because R

    was trying

    > to convert to a different locale than Windows was using.

    Setting locale to "C" does not make R assume UTF-8 is the native
    encoding, there is no way to make UTF-8 the current native
    encoding in R
    on the current builds of R on Windows. This is an old limitation of
    Windows, only recently fixed by Microsoft in recent Windows 10 and
    with
    UCRT Windows runtime (see my blog post [1] for more - to make R
    support
    this we need a new toolchain to build R).

    If you set the locale to C encoding, you are telling R the native
    encoding is C/POSIX (essentially ASCII), not UTF-8.
    Encoding-sensitive
    operations, including conversions, including those conversions that
    happen without user control e.g. for interacting with Windows, will
    produce incorrect results (garbage) or in better case errors,
    warnings,
    omitted, substituted or transliterated characters.

    In principle setting the encoding via locale is dangerous on Windows,
    because Windows has two current encodings, not just one. By setting
    locale you set the one used in the C runtime, but not the other
    one used
    by the system calls. If all code (in R, packages, external libraries)
    was perfect, this would still work as long as all strings used were
    representable in both encodings. For other strings it won't work, and
    then code is not perfect in this regard, it is usually written
    assuming
    there is one current encoding, which common sense dictates should
    be the
    case. With the recent UTF-8 support ([1]), one can switch both of
    these
    to UTF-8.


Well, this is exactly why I want to get rid of the situation. But this 
messes up the output because everything else expects UTF-8 which is 
why I'm looking for some kind of solution.

    > The solution I've now been working on is:
    > I took the sourcecode of R 4.0.3 and changed the backend of

    "gettext" to

    > add an `encoding="something something"` option. And a bit of

    extra stuff

    > like `bind_textdomain_codeset` in case I need to tweak the

    codeset/charset

    > that gettext uses.
    > I think I've got that working properly now and once I solve the

    problem of

    > the encoding in a pkg I will open a bugreport/feature-request

    and I'll add

    > a patch that implements it.

    A number of similar "shortcuts" have been added to R in the past, but
    they may the code more complex, harder to maintain and use, and can't
    realistically solve all of these problems, anyway. Strings will
    eventually be assumed to be in what is the current native encoding by
    the C library. In R, any external code R uses, or code R packages
    use.
    Now that Microsoft finally is supporting UTF-8, the way to get out of
    this is switching to UTF-8. This needs only small changes to R source
    code compared to those "shortcuts" (or to using UTF-16LE). I'd be
    against polluting the code with any more "shortcuts".


I think the addition of " bind_textdomain_codeset" is not strictly 
necessary and can be left out. Because I think setting an environment 
variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
The addition of the "encoding" option to the internal "do_gettext" is 
just a few lines of code and I also undid some duplication between 
do_gettext and do_ngettext. Which should make it easier to maintain. 
But all of that is moot if there is no way to keep the literal strings 
from sources in UTF-8 anyhow.

Before starting on this I did actually read your blogpost about UTF-8 
several times and it seems like the best way forward. Not to mention 
it would make my life easier and me happier when I can stop worrying 
about Windows/Dos codepages!
Thank you for your work on it indeed!

But my problem with that is that a number of people still use an older 
version of windows and your solution won't work there. Which would 
mean that we either drop support for them or they would have to live 
with either weirdlooking translations. Or I have to go back to the 
suboptimal solution of the "C" locale which I really do want to avoid. 
Because as you said it breaks other stuff in unpredictable ways.

The number of people using too old version of Windows should be small 
when this could become ready for production. Windows 8.1. is still 
supported, but there is the free upgrade to Windows 10 (also from no 
longer supported Windows 7), so this should not be a problem for desktop 
machines. It will be a problem for servers.

    > The problem I'm stuck with now is simply this:
    > I have an R pkg here that I want to test the translations with

    and the code

    > is definitely saved as UTF-8, the package has "Encoding: UTF-8"

    in the

    > DESCRIPTION and it all loads and works. The particular problem I

    have is

    > that the R code contains literally: `mathotString <- "Math?t!"`
    > The actual file contains the hexadecimal representation of ? as

    proper

    > utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
    > Seemingly on loading the package, because I haven't done

    anything with it

    > except put it in my debug c-function to print its contents as
    > hexadecimals...
    >
    > The only thing I want to achieve here is that when R loads the

    package it

    > keeps those strings in their original UTF-8 encoding, without

    converting it

    > to "native" or the strange unicode codepoint it seemingly placed

    in there

    > instead. Because otherwise I cannot get gettext to work fully in

    UTF-8 mode.

    >
    > Is this already possible in R?

    In principle, working with strings not representable in the current
    encoding is not reliable (and never will be). It can still work in
    some
    specific cases and uses. Parsing a UTF-8 string literal from a file,
    with correctly declared encoding as documented in WRE, should work at
    least in single-byte encodings. But what happens after that string is
    parsed is another thing. The parsing is based internally on using
    these
    "shortcuts", that is lying to a part of the parser about the
    encoding,
    and telling the rest of the parser that it is really something
    else (not
    native, but UTF-8).


So the reason the string literals are turned into the local encoding 
is because setting the "Encoding" on a package is essentially a hack?

String literals may be turned into local encoding because that is how 
R/packages/external software is written - it needs native encoding. 
Hacks here come when such code is given a string not in the local 
encoding, assuming that under some conditions such code will work. This 
includes a part of the parser and a hack to implement argument 
"encoding" of "parse()", which allows to parse (non-representable) UTF-8 
strings when running in a single-byte locale such as latin 1 (see ?parse).

UTF-8 is supported in R on Windows in many ways, as documented. As long 
as you are using UTF-8 strings representable in the current encoding, so 
that they can be converted to native encoding and back without problems, 
you are fine, R will do the conversions as needed. The troubles come 
when such conversion is not possible. In the example of the parser, 
without the "encoding=" argument to "parse()", the parser will just work 
on any text you give to it, even when the text is in UTF-8: it will work 
by first converting to native encoding and then doing the parsing, no 
hacks involved. When interacting with external software, you'd just tell 
R to provide the strings in the encoding needed by that external 
software, so possibly UTF-8, so possibly convert, but all would work 
fine. The problem are characters not representable in the native encoding.

You mean the memory representation? For that there would be R Internals 
and the sources, essentially there are CHARSXP objects which include an 
encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would 
not access these objects directly, instead use translateChar() if you 
needed strings them in native encoding or translateCharUTF8() if in 
UTF-8, and this is documented in Writing R Extensions.

I think it would be really good if you could provide a complete, minimal 
reproducible example of your problem. It may be there is some 
misunderstanding, especially if you are working with characters 
representable in the current encoding, there should be no problem.

I understand, also it may take a bit of time before this would become 
stable.

Best
Tomas

Cheers,
Joris


    Best,
    Tomas

    [1]
    https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html

    >
    > Cheers,
    > Joris

    >
    >
    > On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosak01 at gmail.com

    <mailto:dbosak01 at gmail.com>> wrote:

    >> Joris:
    >>
    >>
    >>
    >> I?ve fought with encoding problems on Windows a lot.? Here are some
    >> general suggestions.
    >>
    >>
    >>
    >>? ? ?1. Put ?@encoding UTF-8? on any Roxygen comments.
    >>? ? ?2. Put ?encoding = ?UTF-8? on any functions like writeLines or
    >>? ? ?readLines that read/write to a text file.
    >>? ? ?3. This post:
    >> https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
    >>
    >>
    >>
    >> If you have a more specific problem, please describe and we can

    try to

    >> help.
    >>
    >>
    >>
    >> David
    >>
    >>
    >>
    >> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
    >> Windows 10
    >>
    >>
    >>
    >> *From: *joris at jorisgoosen.nl <mailto:joris at jorisgoosen.nl>
    >> *Sent: *Wednesday, December 16, 2020 1:52 PM
    >> *To: *r-package-devel at r-project.org

    <mailto:r-package-devel at r-project.org>

    >> *Subject: *[R-pkg-devel] Package Encoding and Literal Strings
    >>
    >>
    >>
    >> Hello All,
    >>
    >>
    >>
    >> Some context, I am one of the programmers of a software pkg (
    >>
    >> https://jasp-stats.org/) that uses an embedded instance of R to do
    >>
    >> statistics. And make that a bit easier for people who are

    intimidated by R

    >>
    >> or like to have something more GUI oriented.
    >>
    >>
    >>
    >>
    >>
    >> We have been working on translating the interface but ran into

    several

    >>
    >> problems related to encoding of strings. We prefer to use UTF-8 for
    >>
    >> everything and this works wonderful on unix systems, as is to

    be expected.

    >>
    >>
    >>
    >> Windows however is a different matter. Currently I am working

    on some local

    >>
    >> changes to "do_gettext" and some related internal functions of

    R to be able

    >>
    >> to get UTF-8 encoded output from there.
    >>
    >>
    >>
    >> But I ran into a bit of a problem and I think this mailinglist

    is probably

    >>
    >> the best place to start.
    >>
    >>
    >>
    >> It seems that if I have an R package that specifies "Encoding:

    UTF-8" in

    >>
    >> DESCRIPTION the literal strings inside the package are

    converted to the

    >>
    >> local codeset/codepage regardless of what I want.
    >>
    >>
    >>
    >> Is it possible to keep the strings in UTF-8 internally in such

    a pkg

    >>
    >> somehow?
    >>
    >>
    >>
    >> Best regards,
    >>
    >> Joris Goosen
    >>
    >> University of Amsterdam
    >>
    >>
    >>
    >>? ? ? ? ? ? ? ? ? [[alternative HTML version deleted]]
    >>
    >>
    >>
    >> ______________________________________________
    >>
    >> R-package-devel at r-project.org

    <mailto:R-package-devel at r-project.org> mailing list

    >>
    >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
    >>
    >>
    >>

    >? ? ? ?[[alternative HTML version deleted]]
    >
    > ______________________________________________
    > R-package-devel at r-project.org

    <mailto:R-package-devel at r-project.org> mailing list

    > https://stat.ethz.ch/mailman/listinfo/r-package-devel

Thread (12 messages)

joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 16 Tomas Kalibera Package Encoding and Literal Strings Dec 17 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 17 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 17 Tomas Kalibera Package Encoding and Literal Strings Dec 17 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 17 Tomas Kalibera Package Encoding and Literal Strings Dec 18 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 18 Tomas Kalibera Package Encoding and Literal Strings Dec 18 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 21 Tomas Kalibera Package Encoding and Literal Strings Dec 21 joris m@iii@g oii jorisgoose@@@i Package Encoding and Literal Strings Dec 22