[R-pkg-devel] Package Encoding and Literal Strings
Hi Joris, thanks for the example. You can actually simply have Test.R assign the two variables and then run Encoding(utf8StringsPkg1::mathotString) charToRaw(utf8StringsPkg1::mathotString) Encoding(utf8StringsPkg1::tao) charToRaw(utf8StringsPkg1::tao) I tried on Linux, Windows/UTF-8 (the experimental version) and Windows/latin-1 (released version). In all cases, both strings are converted to native encoding. The mathotString is converted to latin-1 fine, because it is representable there. The tao string when running in latin-1 locale gets the escapes <xx>: "<e9><99><b6><e5><be><b7><e5><ba><86>" Btw, the parse(,encoding="UTF-8") hack works, when you parse the modified Test.R file (with the two assignments), and eval the output, you will get those strings in UTF-8. But when you don't eval and print the parse tree in Rgui, it will not be printed correctly (again a limitation of these hacks, they could only do so much). When accessing strings from C, you should always be prepared for any encoding in a CHARSXP, so when you want UTF-8, use "translateCharUTF8()" instead of "CHAR()". That will work fine on representable strings like mathotString, and that is conceptually the correct way to access them. Strings that cannot be represented in the native encoding like tao will get the escapes, and so cannot be converted back to UTF-8. This is not great, but I? see it was the case already in 3.6 (so not a recent regression) and I don't think it would be worth the time trying to fix that - as discussed earlier, only switching to UTF-8 would fix all of these translations, not just one. Btw, the example works fine on the experimentation UTF-8 build on Windows. I am sorry there is not a simple fix for non-representable characters. Best Tomas
On 12/18/20 1:53 PM, joris at jorisgoosen.nl wrote:
Hello Tomas, I have made a minimal example that demonstrates my problem: https://github.com/JorisGoosen/utf8StringsPkg This package is encoded in UTF-8 as is Test.R. There is a little Rcpp function in there I wrote that displays the bytes straight from R's CHAR to be sure no conversion is happening. I would expect that the mathotString had "C3 B4" for "?" but instead it gets "F4". As you can see when you run `utf8StringsPkg::testutf8_in_locale()`. Cheers, Joris On Fri, 18 Dec 2020 at 11:48, Tomas Kalibera <tomas.kalibera at gmail.com <mailto:tomas.kalibera at gmail.com>> wrote: On 12/17/20 6:43 PM, joris at jorisgoosen.nl <mailto:joris at jorisgoosen.nl> wrote:
On Thu, 17 Dec 2020 at 18:22, Tomas Kalibera
<tomas.kalibera at gmail.com <mailto:tomas.kalibera at gmail.com>> wrote:
On 12/17/20 5:17 PM, joris at jorisgoosen.nl
<mailto:joris at jorisgoosen.nl> wrote:
On Thu, 17 Dec 2020 at 10:46, Tomas Kalibera
<tomas.kalibera at gmail.com <mailto:tomas.kalibera at gmail.com>>
wrote:
On 12/16/20 11:07 PM, joris at jorisgoosen.nl
<mailto:joris at jorisgoosen.nl> wrote:
> David,
>
> Thanks for the response!
>
> So the problem is a bit worse then just setting
`encoding="UTF-8"` on
> functions like readLines.
> I'll describe our setup a bit:
> So we run R embedded in a separate executable and
through a whole bunch of
> C(++) magic get that to the main executable that runs
the actual interface.
> All the code that isn't R basically uses UTF-8. This
works good and we've
> made sure that all of our source code is encoded
properly and I've verified
> that for this particular problem at least my source
file is definitely
> encoded in UTF-8 (Ive checked a hexdump).
>
> The simplest solution, that we initially took, to get
R+Windows to
> cooperate with everything is to simply set the locale
to "C" before
> starting R. That way R simply assumes UTF-8 is native
and everything worked
> splendidly. Until of course a file needs to be opened
in R that contains
> some non-ASCII characters. I noticed the problem
because a korean user had
> hangul in his username and that broke everything. This
because R was trying
> to convert to a different locale than Windows was using.
Setting locale to "C" does not make R assume UTF-8 is
the native
encoding, there is no way to make UTF-8 the current
native encoding in R
on the current builds of R on Windows. This is an old
limitation of
Windows, only recently fixed by Microsoft in recent
Windows 10 and with
UCRT Windows runtime (see my blog post [1] for more - to
make R support
this we need a new toolchain to build R).
If you set the locale to C encoding, you are telling R
the native
encoding is C/POSIX (essentially ASCII), not UTF-8.
Encoding-sensitive
operations, including conversions, including those
conversions that
happen without user control e.g. for interacting with
Windows, will
produce incorrect results (garbage) or in better case
errors, warnings,
omitted, substituted or transliterated characters.
In principle setting the encoding via locale is
dangerous on Windows,
because Windows has two current encodings, not just one.
By setting
locale you set the one used in the C runtime, but not
the other one used
by the system calls. If all code (in R, packages,
external libraries)
was perfect, this would still work as long as all
strings used were
representable in both encodings. For other strings it
won't work, and
then code is not perfect in this regard, it is usually
written assuming
there is one current encoding, which common sense
dictates should be the
case. With the recent UTF-8 support ([1]), one can
switch both of these
to UTF-8.
Well, this is exactly why I want to get rid of the
situation. But this messes up the output because everything
else expects UTF-8 which is why I'm looking for some kind of
solution.
> The solution I've now been working on is:
> I took the sourcecode of R 4.0.3 and changed the
backend of "gettext" to
> add an `encoding="something something"` option. And a
bit of extra stuff
> like `bind_textdomain_codeset` in case I need to tweak
the codeset/charset
> that gettext uses.
> I think I've got that working properly now and once I
solve the problem of
> the encoding in a pkg I will open a
bugreport/feature-request and I'll add
> a patch that implements it.
A number of similar "shortcuts" have been added to R in
the past, but
they may the code more complex, harder to maintain and
use, and can't
realistically solve all of these problems, anyway.
Strings will
eventually be assumed to be in what is the current
native encoding by
the C library. In R, any external code R uses, or code R
packages use.
Now that Microsoft finally is supporting UTF-8, the way
to get out of
this is switching to UTF-8. This needs only small
changes to R source
code compared to those "shortcuts" (or to using
UTF-16LE). I'd be
against polluting the code with any more "shortcuts".
I think the addition of " bind_textdomain_codeset" is not
strictly necessary and can be left out. Because I think
setting an environment variable as "OUTPUT_CHARSET=UTF-8"
gives the same result for us.
The addition of the "encoding" option to the internal
"do_gettext" is just a few lines of code and I also undid
some duplication between do_gettext and do_ngettext. Which
should make it easier to maintain. But all of that is moot
if there is no way to keep the literal strings from sources
in UTF-8 anyhow.
Before starting on this I did actually read your blogpost
about UTF-8 several times and it seems like the best way
forward. Not to mention it would make my life easier and me
happier when I can stop worrying about Windows/Dos codepages!
Thank you for your work on it indeed!
But my problem with that is that a number of people still
use an older version of windows and your solution won't work
there. Which would mean that we either drop support for them
or they would have to live with either weirdlooking
translations. Or I have to go back to the suboptimal
solution of the "C" locale which I really do want to avoid.
Because as you said it breaks other stuff in unpredictable ways.
The number of people using too old version of Windows should
be small when this could become ready for production. Windows
8.1. is still supported, but there is the free upgrade to
Windows 10 (also from no longer supported Windows 7), so this
should not be a problem for desktop machines. It will be a
problem for servers.
Well, I would not expect anyone to use a GUI-heavy application
meant for researchers on a server anyway so that would be fine.
> The problem I'm stuck with now is simply this:
> I have an R pkg here that I want to test the
translations with and the code
> is definitely saved as UTF-8, the package has
"Encoding: UTF-8" in the
> DESCRIPTION and it all loads and works. The particular
problem I have is
> that the R code contains literally: `mathotString <-
"Math?t!"`
> The actual file contains the hexadecimal
representation of ? as proper
> utf-8: "0xC3 0xB4" but R turns it into: "0xf4".
> Seemingly on loading the package, because I haven't
done anything with it
> except put it in my debug c-function to print its
contents as
> hexadecimals...
>
> The only thing I want to achieve here is that when R
loads the package it
> keeps those strings in their original UTF-8 encoding,
without converting it
> to "native" or the strange unicode codepoint it
seemingly placed in there
> instead. Because otherwise I cannot get gettext to
work fully in UTF-8 mode.
>
> Is this already possible in R?
In principle, working with strings not representable in
the current
encoding is not reliable (and never will be). It can
still work in some
specific cases and uses. Parsing a UTF-8 string literal
from a file,
with correctly declared encoding as documented in WRE,
should work at
least in single-byte encodings. But what happens after
that string is
parsed is another thing. The parsing is based internally
on using these
"shortcuts", that is lying to a part of the parser about
the encoding,
and telling the rest of the parser that it is really
something else (not
native, but UTF-8).
So the reason the string literals are turned into the local
encoding is because setting the "Encoding" on a package is
essentially a hack?
String literals may be turned into local encoding because
that is how R/packages/external software is written - it
needs native encoding. Hacks here come when such code is
given a string not in the local encoding, assuming that under
some conditions such code will work. This includes a part of
the parser and a hack to implement argument "encoding" of
"parse()", which allows to parse (non-representable) UTF-8
strings when running in a single-byte locale such as latin 1
(see ?parse).
So the same `parse` function is used for loading a package?
Parsing for usual packages is done at build time, when they are
serialized ("prepared for lazy loading"). I would have to look for
the details in the code, but either way, if the input is in UTF-8
but the native encoding is different, either the input has to be
converted to native encoding for the parser, or that hack when
part of the parser is being lied to about the encoding (either via
"parse()" or other way). If you have a minimal reproducible
example, I can help you find out whether the behavior seen is
expected/documented/bug.
Because in that case I wonder if the "Encoding" option in
"DESCRIPTION" is handled the same as `encoding=` in parse.
?parse states:
> Character strings in the result will have a declared encoding
if |encoding| is |"latin1"| or |"UTF-8"|, or if |text| is
supplied with every element of known encoding in a Latin-1 or
UTF-8 locale.
The sentence is a bit hard for me personally to parse but I
interpret that first part to mean that if "encoding" is specified
as "UTF-8" all the character string in the result will also have
that encoding.
Is that a correct interpretation?
Because if so I do believe I found a problem and I will try to
make a minimal reproducable example.
Please look first at this part of "?parse":
"encoding: encoding to be assumed for input strings.? If the value
is ?"latin1"? or ?"UTF-8"? it is used to mark character strings as
known to be in Latin-1 or UTF-8: it is not used to re-encode the
input.? To do the latter, specify the encoding as part of the
connection ?con? or _via_ ?options(encoding=)?: see the example
under ?file?. Arguments ?encoding = "latin1"? and ?encoding =
"UTF-8"? are ignored with a warning when running in a MBCS locale."
Together with the one you cite:
"Character strings in the result will have a declared encoding if
?encoding? is ?"latin1"? or ?"UTF-8"?, or if ?text? is supplied
with every element of known encoding in a Latin-1 or UTF-8 locale."
There are two things: which encoding strings are really encoded
in, and which encoding they are declared to be in. Normally this
should always be the same encoding (UTF-8, latin-1, or the
concrete known native encoding), but the "encoding=" argument
allows to play with this. Strings declared to be in "native"
encoding for a while are treated as (single-byte) unknown encoding
and eventually they are declared to be of the encoding from the
"encoding=" argument. This only applies to strings declared as
"native". When strings are declared as UTF-8 or latin-1, they must
be in that encoding, and believed to be in that, the "encoding="
argument does not affect those.
So, when your inputs are declared as UTF-8, the "encoding=" hack
should not apply to them. Also note that ASCII strings are never
declared to be UTF-8 nor latin-1, they are always as "native" (and
ASCII is assumed a subset of all encodings). But your inputs
probably are not declared to be in UTF-8 (note this is "declared"
wrt to Encoding() R function, the encoding flag that character
objects in R have), because you are probably parsing from a file.
I'd really need a reproducible example to be able to explain what
you are seeing.
Best
Tomas
The part that is being "lied to" may get confused or
not. It would not when the real native encoding is say
latin1, a common
case in the past for which the hack was created, but it
might when it is
a double-byte encoding that conflicts with the text
being parsed in
dangerous ways. This is also why this hack only makes
sense for string
literals (and comments), and still only to a limit as
the strings may be
misinterpreted later after parsing.
Well our case is entirely limited to string literals that
are presented to the user through an all-utf-8 interface.
So I would assume not of the edge-cases would come into play.
Any systempaths and things like that would still be in local
encoding.
So a really short summary is: you can only reliably use
strings
representable in the current encoding in R, and that
encoding cannot be
UTF-8 on Windows in released versions of R. There is an
experimental
version, see [1], if you could experiment with that and
see whether that
might work for your applications, could try to find and
report bugs
there (e.g. to me directly), that would be useful.
So when I read in certain R documentation that string can
have an "UTF-8" encoding in R this is not true?
As in, when I read documentation such as
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
it really seems to indicate to me that UTF-8 is in fact
supported in R on windows.
My assumption was that R uses `translateChar` internally to
make sure it is in the right encoding before interfacing
with the OS and other places where this might matter.
UTF-8 is supported in R on Windows in many ways, as
documented. As long as you are using UTF-8 strings
representable in the current encoding, so that they can be
converted to native encoding and back without problems, you
are fine, R will do the conversions as needed. The troubles
come when such conversion is not possible. In the example of
the parser, without the "encoding=" argument to "parse()",
the parser will just work on any text you give to it, even
when the text is in UTF-8: it will work by first converting
to native encoding and then doing the parsing, no hacks
involved. When interacting with external software, you'd just
tell R to provide the strings in the encoding needed by that
external software, so possibly UTF-8, so possibly convert,
but all would work fine. The problem are characters not
representable in the native encoding.
Exactly, I want to be able to support chinese etc as well while
running in a west-european locale.
This is also what mislead me, because I thought it was actually
reading it like that but the character is part of my local locale
so I didn't notice it. Especially as it was being printed
correctly. I only noticed after printing the literal values.
If you find behavior re encodings in released versions
of R that
contradicts the current documentation, please report
with a minimal
reproducible example, such cases should be fixed (even
though sometimes
the "fix" would be just changing the documentation, the
effort really
should be now for supporting UTF-8 for real).
Specifically with
"mathotString", you might try creating? an example that
does not include
any package (just calls to parse with encoding options
set), only then
gradually adding more of package loading if that does
not reproduce. It
would be important to know the current encoding
(sessionInfo, l10n_info).
Well, the reason I mailed the mailing list was because I
couldn't for the life of me find any documentation that told
me anything in particular about how literal strings are
supposed to be stored in memory. But it just seems logical
to me that if R already supports parsing and loading a
package encoded with UTF-8 and it supports having UTF-8
strings in memory next to strings in native encoding the
most straightforward way of loading this literal strings
would be in UTF-8.
You mean the memory representation? For that there would be R
Internals and the sources, essentially there are CHARSXP
objects which include an encoding tag (UTF-8, Latin-1 or
native) and the raw bytes. But you would not access these
objects directly, instead use translateChar() if you needed
strings them in native encoding or translateCharUTF8() if in
UTF-8, and this is documented in Writing R Extensions.
Exactly, because gettext operates in C and the source files for
that are also in utf-8 the actual memory representation of the
string in R needs to be identical, otherwise it won't work.
I think it would be really good if you could provide a
complete, minimal reproducible example of your problem. It
may be there is some misunderstanding, especially if you are
working with characters representable in the current
encoding, there should be no problem.
It depends on if I now understand ?parse correctly in that it
should have the strings in a package that is parsed with the
specified encoding in that encoding or not. As I wondered above.
I would love to use the new version of R that supports
properly interfacing with windows 10.
And given that the only other supported version of Windows
is 8.1 and barely anyone uses it. So it might be worth
dropping support for that.
I just hoped I could find a workable solution without such a
step.
I understand, also it may take a bit of time before this
would become stable.
Of course.
Hopefully I can still use my current workaround for the time
being and then switch over to the UTF-8 ready version if it
becomes production-ready at some point.
Cheers,
Joris
Best
Tomas
Cheers,
Joris
Best,
Tomas
[1]
https://developer.r-project.org/Blog/public/2020/07/30/windows/utf-8-build-of-r-and-cran-packages/index.html
>
> Cheers,
> Joris
>
>
> On Wed, 16 Dec 2020 at 20:15, David Bosak
<dbosak01 at gmail.com <mailto:dbosak01 at gmail.com>> wrote:
>
>> Joris:
>>
>>
>>
>> I?ve fought with encoding problems on Windows a lot.?
Here are some
>> general suggestions.
>>
>>
>>
>>? ? ?1. Put ?@encoding UTF-8? on any Roxygen comments.
>>? ? ?2. Put ?encoding = ?UTF-8? on any functions like
writeLines or
>>? ? ?readLines that read/write to a text file.
>>? ? ?3. This post:
>>
>>
>>
>>
>> If you have a more specific problem, please describe
and we can try to
>> help.
>>
>>
>>
>> David
>>
>>
>>
>> Sent from Mail
>> Windows 10
>>
>>
>>
>> *From: *joris at jorisgoosen.nl
<mailto:joris at jorisgoosen.nl>
>> *Sent: *Wednesday, December 16, 2020 1:52 PM
>> *To: *r-package-devel at r-project.org
<mailto:r-package-devel at r-project.org>
>> *Subject: *[R-pkg-devel] Package Encoding and Literal
Strings
>>
>>
>>
>> Hello All,
>>
>>
>>
>> Some context, I am one of the programmers of a
software pkg (
>>
>> https://jasp-stats.org/) that uses an embedded
instance of R to do
>>
>> statistics. And make that a bit easier for people who
are intimidated by R
>>
>> or like to have something more GUI oriented.
>>
>>
>>
>>
>>
>> We have been working on translating the interface but
ran into several
>>
>> problems related to encoding of strings. We prefer to
use UTF-8 for
>>
>> everything and this works wonderful on unix systems,
as is to be expected.
>>
>>
>>
>> Windows however is a different matter. Currently I am
working on some local
>>
>> changes to "do_gettext" and some related internal
functions of R to be able
>>
>> to get UTF-8 encoded output from there.
>>
>>
>>
>> But I ran into a bit of a problem and I think this
mailinglist is probably
>>
>> the best place to start.
>>
>>
>>
>> It seems that if I have an R package that specifies
"Encoding: UTF-8" in
>>
>> DESCRIPTION the literal strings inside the package
are converted to the
>>
>> local codeset/codepage regardless of what I want.
>>
>>
>>
>> Is it possible to keep the strings in UTF-8
internally in such a pkg
>>
>> somehow?
>>
>>
>>
>> Best regards,
>>
>> Joris Goosen
>>
>> University of Amsterdam
>>
>>
>>
>>? ? ? ? ? ? ? ? ? [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________________________
>>
>> R-package-devel at r-project.org
<mailto:R-package-devel at r-project.org> mailing list
>>
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>>
>>
>>
>? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-package-devel at r-project.org
<mailto:R-package-devel at r-project.org> mailing list