On 12/17/20 6:43 PM, joris at jorisgoosen.nl wrote:
On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:
On 12/17/20 5:17 PM, joris at jorisgoosen.nl wrote:
On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:
On 12/16/20 11:07 PM, joris at jorisgoosen.nl wrote:
David,
Thanks for the response!
So the problem is a bit worse then just setting `encoding="UTF-8"` on
functions like readLines.
I'll describe our setup a bit:
So we run R embedded in a separate executable and through a whole
C(++) magic get that to the main executable that runs the actual
All the code that isn't R basically uses UTF-8. This works good and
made sure that all of our source code is encoded properly and I've
that for this particular problem at least my source file is definitely
encoded in UTF-8 (Ive checked a hexdump).
The simplest solution, that we initially took, to get R+Windows to
cooperate with everything is to simply set the locale to "C" before
starting R. That way R simply assumes UTF-8 is native and everything
splendidly. Until of course a file needs to be opened in R that
some non-ASCII characters. I noticed the problem because a korean user
hangul in his username and that broke everything. This because R was
to convert to a different locale than Windows was using.
Setting locale to "C" does not make R assume UTF-8 is the native
encoding, there is no way to make UTF-8 the current native encoding in R
on the current builds of R on Windows. This is an old limitation of
Windows, only recently fixed by Microsoft in recent Windows 10 and with
UCRT Windows runtime (see my blog post [1] for more - to make R support
this we need a new toolchain to build R).
If you set the locale to C encoding, you are telling R the native
encoding is C/POSIX (essentially ASCII), not UTF-8. Encoding-sensitive
operations, including conversions, including those conversions that
happen without user control e.g. for interacting with Windows, will
produce incorrect results (garbage) or in better case errors, warnings,
omitted, substituted or transliterated characters.
In principle setting the encoding via locale is dangerous on Windows,
because Windows has two current encodings, not just one. By setting
locale you set the one used in the C runtime, but not the other one used
by the system calls. If all code (in R, packages, external libraries)
was perfect, this would still work as long as all strings used were
representable in both encodings. For other strings it won't work, and
then code is not perfect in this regard, it is usually written assuming
there is one current encoding, which common sense dictates should be the
case. With the recent UTF-8 support ([1]), one can switch both of these
to UTF-8.
Well, this is exactly why I want to get rid of the situation. But this
messes up the output because everything else expects UTF-8 which is why I'm
looking for some kind of solution.
The solution I've now been working on is:
I took the sourcecode of R 4.0.3 and changed the backend of "gettext"
add an `encoding="something something"` option. And a bit of extra
like `bind_textdomain_codeset` in case I need to tweak the
that gettext uses.
I think I've got that working properly now and once I solve the
the encoding in a pkg I will open a bugreport/feature-request and I'll
a patch that implements it.
A number of similar "shortcuts" have been added to R in the past, but
they may the code more complex, harder to maintain and use, and can't
realistically solve all of these problems, anyway. Strings will
eventually be assumed to be in what is the current native encoding by
the C library. In R, any external code R uses, or code R packages use.
Now that Microsoft finally is supporting UTF-8, the way to get out of
this is switching to UTF-8. This needs only small changes to R source
code compared to those "shortcuts" (or to using UTF-16LE). I'd be
against polluting the code with any more "shortcuts".
I think the addition of " bind_textdomain_codeset" is not strictly
necessary and can be left out. Because I think setting an environment
variable as "OUTPUT_CHARSET=UTF-8" gives the same result for us.
The addition of the "encoding" option to the internal "do_gettext" is
just a few lines of code and I also undid some duplication between
do_gettext and do_ngettext. Which should make it easier to maintain. But
all of that is moot if there is no way to keep the literal strings from
sources in UTF-8 anyhow.
Before starting on this I did actually read your blogpost about UTF-8
several times and it seems like the best way forward. Not to mention it
would make my life easier and me happier when I can stop worrying about
Windows/Dos codepages!
Thank you for your work on it indeed!
But my problem with that is that a number of people still use an older
version of windows and your solution won't work there. Which would mean
that we either drop support for them or they would have to live with either
weirdlooking translations. Or I have to go back to the suboptimal solution
of the "C" locale which I really do want to avoid. Because as you said it
breaks other stuff in unpredictable ways.
The number of people using too old version of Windows should be small
when this could become ready for production. Windows 8.1. is still
supported, but there is the free upgrade to Windows 10 (also from no longer
supported Windows 7), so this should not be a problem for desktop machines.
It will be a problem for servers.
Well, I would not expect anyone to use a GUI-heavy application meant for
researchers on a server anyway so that would be fine.
The problem I'm stuck with now is simply this:
I have an R pkg here that I want to test the translations with and the
is definitely saved as UTF-8, the package has "Encoding: UTF-8" in the
DESCRIPTION and it all loads and works. The particular problem I have
that the R code contains literally: `mathotString <- "Math?t!"`
The actual file contains the hexadecimal representation of ? as proper
utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
Seemingly on loading the package, because I haven't done anything with
except put it in my debug c-function to print its contents as
hexadecimals...
The only thing I want to achieve here is that when R loads the package
keeps those strings in their original UTF-8 encoding, without
to "native" or the strange unicode codepoint it seemingly placed in
instead. Because otherwise I cannot get gettext to work fully in UTF-8
Is this already possible in R?
In principle, working with strings not representable in the current
encoding is not reliable (and never will be). It can still work in some
specific cases and uses. Parsing a UTF-8 string literal from a file,
with correctly declared encoding as documented in WRE, should work at
least in single-byte encodings. But what happens after that string is
parsed is another thing. The parsing is based internally on using these
"shortcuts", that is lying to a part of the parser about the encoding,
and telling the rest of the parser that it is really something else (not
native, but UTF-8).
So the reason the string literals are turned into the local encoding is
because setting the "Encoding" on a package is essentially a hack?
String literals may be turned into local encoding because that is how
R/packages/external software is written - it needs native encoding. Hacks
here come when such code is given a string not in the local encoding,
assuming that under some conditions such code will work. This includes a
part of the parser and a hack to implement argument "encoding" of
"parse()", which allows to parse (non-representable) UTF-8 strings when
running in a single-byte locale such as latin 1 (see ?parse).
So the same `parse` function is used for loading a package?
Parsing for usual packages is done at build time, when they are serialized
("prepared for lazy loading"). I would have to look for the details in the
code, but either way, if the input is in UTF-8 but the native encoding is
different, either the input has to be converted to native encoding for the
parser, or that hack when part of the parser is being lied to about the
encoding (either via "parse()" or other way). If you have a minimal
reproducible example, I can help you find out whether the behavior seen is
expected/documented/bug.
Because in that case I wonder if the "Encoding" option in "DESCRIPTION" is
handled the same as `encoding=` in parse.
?parse states:
Character strings in the result will have a declared encoding if
encoding is "latin1" or "UTF-8", or if text is supplied with every
element of known encoding in a Latin-1 or UTF-8 locale.
The sentence is a bit hard for me personally to parse but I interpret that
first part to mean that if "encoding" is specified as "UTF-8" all the
character string in the result will also have that encoding.
Is that a correct interpretation?
Because if so I do believe I found a problem and I will try to make a
minimal reproducable example.
Please look first at this part of "?parse":
"encoding: encoding to be assumed for input strings. If the value is
?"latin1"? or ?"UTF-8"? it is used to mark character strings as known to be
in Latin-1 or UTF-8: it is not used to re-encode the input. To do the
latter, specify the encoding as part of the connection ?con? or _via_
?options(encoding=)?: see the example under ?file?. Arguments ?encoding =
"latin1"? and ?encoding = "UTF-8"? are ignored with a warning when running
in a MBCS locale."
Together with the one you cite:
"Character strings in the result will have a declared encoding if
?encoding? is ?"latin1"? or ?"UTF-8"?, or if ?text? is supplied with every
element of known encoding in a Latin-1 or UTF-8 locale."
There are two things: which encoding strings are really encoded in, and
which encoding they are declared to be in. Normally this should always be
the same encoding (UTF-8, latin-1, or the concrete known native encoding),
but the "encoding=" argument allows to play with this. Strings declared to
be in "native" encoding for a while are treated as (single-byte) unknown
encoding and eventually they are declared to be of the encoding from the
"encoding=" argument. This only applies to strings declared as "native".
When strings are declared as UTF-8 or latin-1, they must be in that
encoding, and believed to be in that, the "encoding=" argument does not
affect those.
So, when your inputs are declared as UTF-8, the "encoding=" hack should
not apply to them. Also note that ASCII strings are never declared to be
UTF-8 nor latin-1, they are always as "native" (and ASCII is assumed a
subset of all encodings). But your inputs probably are not declared to be
in UTF-8 (note this is "declared" wrt to Encoding() R function, the
encoding flag that character objects in R have), because you are probably
parsing from a file. I'd really need a reproducible example to be able to
explain what you are seeing.
Best
Tomas
The part that is being "lied to" may get confused or
not. It would not when the real native encoding is say latin1, a common
case in the past for which the hack was created, but it might when it is
a double-byte encoding that conflicts with the text being parsed in
dangerous ways. This is also why this hack only makes sense for string
literals (and comments), and still only to a limit as the strings may be
misinterpreted later after parsing.
Well our case is entirely limited to string literals that are presented
to the user through an all-utf-8 interface.
So I would assume not of the edge-cases would come into play.
Any systempaths and things like that would still be in local encoding.
So a really short summary is: you can only reliably use strings
representable in the current encoding in R, and that encoding cannot be
UTF-8 on Windows in released versions of R. There is an experimental
version, see [1], if you could experiment with that and see whether that
might work for your applications, could try to find and report bugs
there (e.g. to me directly), that would be useful.
So when I read in certain R documentation that string can have an "UTF-8"
encoding in R this is not true?
As in, when I read documentation such as
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html it
really seems to indicate to me that UTF-8 is in fact supported in R on
windows.
My assumption was that R uses `translateChar` internally to make sure it
is in the right encoding before interfacing with the OS and other places
where this might matter.
UTF-8 is supported in R on Windows in many ways, as documented. As long
as you are using UTF-8 strings representable in the current encoding, so
that they can be converted to native encoding and back without problems,
you are fine, R will do the conversions as needed. The troubles come when
such conversion is not possible. In the example of the parser, without the
"encoding=" argument to "parse()", the parser will just work on any text
you give to it, even when the text is in UTF-8: it will work by first
converting to native encoding and then doing the parsing, no hacks
involved. When interacting with external software, you'd just tell R to
provide the strings in the encoding needed by that external software, so
possibly UTF-8, so possibly convert, but all would work fine. The problem
are characters not representable in the native encoding.
Exactly, I want to be able to support chinese etc as well while running in
a west-european locale.
This is also what mislead me, because I thought it was actually reading it
like that but the character is part of my local locale so I didn't notice
it. Especially as it was being printed correctly. I only noticed after
printing the literal values.
If you find behavior re encodings in released versions of R that
contradicts the current documentation, please report with a minimal
reproducible example, such cases should be fixed (even though sometimes
the "fix" would be just changing the documentation, the effort really
should be now for supporting UTF-8 for real). Specifically with
"mathotString", you might try creating an example that does not include
any package (just calls to parse with encoding options set), only then
gradually adding more of package loading if that does not reproduce. It
would be important to know the current encoding (sessionInfo, l10n_info).
Well, the reason I mailed the mailing list was because I couldn't for the
life of me find any documentation that told me anything in particular about
how literal strings are supposed to be stored in memory. But it just seems
logical to me that if R already supports parsing and loading a package
encoded with UTF-8 and it supports having UTF-8 strings in memory next to
strings in native encoding the most straightforward way of loading this
literal strings would be in UTF-8.
You mean the memory representation? For that there would be R Internals
and the sources, essentially there are CHARSXP objects which include an
encoding tag (UTF-8, Latin-1 or native) and the raw bytes. But you would
not access these objects directly, instead use translateChar() if you
needed strings them in native encoding or translateCharUTF8() if in UTF-8,
and this is documented in Writing R Extensions.
Exactly, because gettext operates in C and the source files for that are
also in utf-8 the actual memory representation of the string in R needs to
be identical, otherwise it won't work.
I think it would be really good if you could provide a complete, minimal
reproducible example of your problem. It may be there is some
misunderstanding, especially if you are working with characters
representable in the current encoding, there should be no problem.
It depends on if I now understand ?parse correctly in that it should have
the strings in a package that is parsed with the specified encoding in that
encoding or not. As I wondered above.
I would love to use the new version of R that supports properly
interfacing with windows 10.
And given that the only other supported version of Windows is 8.1 and
barely anyone uses it. So it might be worth dropping support for that.
I just hoped I could find a workable solution without such a step.
I understand, also it may take a bit of time before this would become
stable.
Of course.
Hopefully I can still use my current workaround for the time being and
then switch over to the UTF-8 ready version if it becomes production-ready
at some point.
Cheers,
Joris
Best
On Wed, 16 Dec 2020 at 20:15, David Bosak <dbosak01 at gmail.com> wrote:
Joris:
I?ve fought with encoding problems on Windows a lot. Here are some
general suggestions.
1. Put ?@encoding UTF-8? on any Roxygen comments.
2. Put ?encoding = ?UTF-8? on any functions like writeLines or
readLines that read/write to a text file.
3. This post:
If you have a more specific problem, please describe and we can try to
help.
David
Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
Windows 10
*From: *joris at jorisgoosen.nl
*Sent: *Wednesday, December 16, 2020 1:52 PM
*To: *r-package-devel at r-project.org
*Subject: *[R-pkg-devel] Package Encoding and Literal Strings
Hello All,
Some context, I am one of the programmers of a software pkg (
https://jasp-stats.org/) that uses an embedded instance of R to do
statistics. And make that a bit easier for people who are intimidated
or like to have something more GUI oriented.
We have been working on translating the interface but ran into several
problems related to encoding of strings. We prefer to use UTF-8 for
everything and this works wonderful on unix systems, as is to be
Windows however is a different matter. Currently I am working on some
changes to "do_gettext" and some related internal functions of R to
to get UTF-8 encoded output from there.
But I ran into a bit of a problem and I think this mailinglist is
the best place to start.
It seems that if I have an R package that specifies "Encoding: UTF-8"
DESCRIPTION the literal strings inside the package are converted to
local codeset/codepage regardless of what I want.
Is it possible to keep the strings in UTF-8 internally in such a pkg
somehow?
Best regards,
Joris Goosen
University of Amsterdam
[[alternative HTML version deleted]]