Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is ? and the code is U+6703. Following the code I arrive to the instruction
base::format.default("?")
which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it follows on most recent versions of R.

Is it expected?

Thank you!

Iago
I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is ? and the code is U+6703. Following the code I arrive to the instruction

base::format.default("?")
which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it follows on most recent versions of R.

Is it expected?
If you are running this on Windows in an encoding where the character 
cannot be represented (e.g. non-Chinese locale), then yes, this is 
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the 
character will be formatted/displayed properly.

Best
Tomas
Thank you!

Iago

	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
But if I type
"?"
the output is

[1] "?"

so seemingly it can be represented. Or, am I wrong?

Best
Iago
Enviat el: divendres, 13 de setembre de 2019 11:24
Per a: IAGO GIN? V?ZQUEZ <i.gine at pssjd.org>; r-devel at r-project.org <r-devel at r-project.org>
Tema: Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2 -windows 10

On 9/13/19 11:01 AM, IAGO GIN? V?ZQUEZ wrote:
> I have a chinese character on a data frame, but the output of printing it is its UTF-8 code. Concretely, the character is ? and the code is U+6703. Following the code I arrive to the instruction
>
>> base::format.default("?")
> which prints
>
> [1] "<U+6703>"
>
> I do not know which is the extent of this behaviour either if it follows on most recent versions of R.
>
> Is it expected?

If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

>
> Thank you!
>
> Iago
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
But if I type
"?"
the output is
[1] "?"
so seemingly it can be represented. Or, am I wrong?
In RGui you can print the string, because RGui is a Windows Unicode 
application (uses UTF16-LE and bypasses the C runtime for strings). But 
it is just the gui, R itself (and hence also packages) use the current 
native encoding as defined by the C runtime. RGui will make sure R gets 
the string in UTF-8, but as soon as you do anything even slightly 
non-trivial, which includes formatting, the string will be converted to 
the current native encoding. Some R functions allow you to do certain 
things in UTF-8 without conversion to native encoding, you'd have to 
read very carefully the documentation for each function - but for 
practical use, you either need to live with the misinterpretation of 
some characters, or use Windows in the locale where your characters can 
be represented (e.g. Chinese locale when working with Chinese strings), 
or use Linux/maOS. On Linux/macOS the current native encoding can be 
UTF-8, so there is no problem. On Windows, with the current toolchain 
based on mingw, this is not possible.

Best
Tomas
Best
Iago
------------------------------------------------------------------------
*De:* Tomas Kalibera <tomas.kalibera at gmail.com>
*Enviat el:* divendres, 13 de setembre de 2019 11:24
*Per a:* IAGO GIN? V?ZQUEZ <i.gine at pssjd.org>; r-devel at r-project.org 
<r-devel at r-project.org>
*Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2 
-windows 10
On 9/13/19 11:01 AM, IAGO GIN? V?ZQUEZ wrote:
I have a chinese character on a data frame, but the output of 
printing it is its UTF-8 code. Concretely, the character is ? and the 
code is U+6703. Following the code I arrive to the instruction

base::format.default("?")
which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it 
follows on most recent versions of R.
Is it expected?
If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

Thank you!

Iago

??????? [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:
On 9/13/19 11:37 AM, IAGO GIN? V?ZQUEZ wrote:
But if I type
"?"
the output is
[1] "?"
so seemingly it can be represented. Or, am I wrong?
In RGui you can print the string, because RGui is a Windows Unicode
application (uses UTF16-LE and bypasses the C runtime for strings). But
it is just the gui, R itself (and hence also packages) use the current
native encoding as defined by the C runtime. RGui will make sure R gets
the string in UTF-8, but as soon as you do anything even slightly
non-trivial, which includes formatting, the string will be converted to
the current native encoding. Some R functions allow you to do certain
things in UTF-8 without conversion to native encoding, you'd have to
read very carefully the documentation for each function - but for
practical use, you either need to live with the misinterpretation of
some characters, or use Windows in the locale where your characters can
be represented (e.g. Chinese locale when working with Chinese strings),
or use Linux/maOS. On Linux/macOS the current native encoding can be
UTF-8, so there is no problem. On Windows, with the current toolchain
based on mingw, this is not possible.

mingw-w64 is capable of processing utf-8 (it can process bytes after all).
Can you explain what you mean here? Would any other compiler on Windows not
suffer from this problem?

Best
Tomas

Best
Iago
------------------------------------------------------------------------
*De:* Tomas Kalibera <tomas.kalibera at gmail.com>
*Enviat el:* divendres, 13 de setembre de 2019 11:24
*Per a:* IAGO GIN? V?ZQUEZ <i.gine at pssjd.org>; r-devel at r-project.org
<r-devel at r-project.org>
*Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
-windows 10
On 9/13/19 11:01 AM, IAGO GIN? V?ZQUEZ wrote:
I have a chinese character on a data frame, but the output of
printing it is its UTF-8 code. Concretely, the character is ? and the
code is U+6703. Following the code I arrive to the instruction

base::format.default("?")
which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it
follows on most recent versions of R.
Is it expected?
If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

Thank you!

Iago

       [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

        [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera 
<tomas.kalibera at gmail.com <mailto:tomas.kalibera at gmail.com>> wrote:

    On 9/13/19 11:37 AM, IAGO GIN? V?ZQUEZ wrote:
    > But if I type
    > >"?"
    > the output is
    > [1] "?"
    > so seemingly it can be represented. Or, am I wrong?
    In RGui you can print the string, because RGui is a Windows Unicode
    application (uses UTF16-LE and bypasses the C runtime for
    strings). But
    it is just the gui, R itself (and hence also packages) use the
    current
    native encoding as defined by the C runtime. RGui will make sure R
    gets
    the string in UTF-8, but as soon as you do anything even slightly
    non-trivial, which includes formatting, the string will be
    converted to
    the current native encoding. Some R functions allow you to do certain
    things in UTF-8 without conversion to native encoding, you'd have to
    read very carefully the documentation for each function - but for
    practical use, you either need to live with the misinterpretation of
    some characters, or use Windows in the locale where your
    characters can
    be represented (e.g. Chinese locale when working with Chinese
    strings),
    or use Linux/maOS. On Linux/macOS the current native encoding can be
    UTF-8, so there is no problem. On Windows, with the current toolchain
    based on mingw, this is not possible.

mingw-w64 is capable of processing utf-8 (it can process bytes after 
all). Can you explain what you mean here? Would any other compiler on 
Windows not suffer from this problem?
The problem is using UTF-8 as the current locale as understood by the C 
runtime/C library. By default mingw uses msvcrt, which does not allow 
UTF-8 as current locale (via setlocale()). Now mingw also allows to 
build with UCRT (recently), and I hope one day we will be able to use 
it, but it is not yet the default, msys2 does not use it yet for its 
mingw_ packages and we need also the external packages . Note that R 
(CRAN, and also BIOC) provide binary versions of all packages for 
Windows, they need to build them and they need all library dependencies. 
All of those would have to be rebuilt with UCRT, which will be a huge 
task. Fixing R on its own to support UTF-8 natively on Windows when the 
C runtime allows it won't be hard, because R already can do it on Unix, 
but the problem is all the dependencies.

Tomas

    Best
    Tomas

    >
    > Best
    > Iago
    >
    ------------------------------------------------------------------------
    > *De:* Tomas Kalibera <tomas.kalibera at gmail.com
    <mailto:tomas.kalibera at gmail.com>>
    > *Enviat el:* divendres, 13 de setembre de 2019 11:24
    > *Per a:* IAGO GIN? V?ZQUEZ <i.gine at pssjd.org
    <mailto:i.gine at pssjd.org>>; r-devel at r-project.org
    <mailto:r-devel at r-project.org>
    > <r-devel at r-project.org <mailto:r-devel at r-project.org>>
    > *Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
    > -windows 10
    > On 9/13/19 11:01 AM, IAGO GIN? V?ZQUEZ wrote:
    > > I have a chinese character on a data frame, but the output of
    > printing it is its UTF-8 code. Concretely, the character is ?
    and the
    > code is U+6703. Following the code I arrive to the instruction
    > >
    > >> base::format.default("?")
    > > which prints
    > >
    > > [1] "<U+6703>"
    > >
    > > I do not know which is the extent of this behaviour either if it
    > follows on most recent versions of R.
    > >
    > > Is it expected?
    >
    > If you are running this on Windows in an encoding where the
    character
    > cannot be represented (e.g. non-Chinese locale), then yes, this is
    > expected behavior.
    >
    > On Unix systems where R can run in UTF-8 encoding (Linux,
    macOS), the
    > character will be formatted/displayed properly.
    >
    > Best
    > Tomas
    >
    > >
    > > Thank you!
    > >
    > > Iago
    > >
    > >??????? [[alternative HTML version deleted]]
    > >
    > > ______________________________________________
    > > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
    > > https://stat.ethz.ch/mailman/listinfo/r-devel
    >
    >

    ? ? ? ? [[alternative HTML version deleted]]

    ______________________________________________
    R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel

On Fri, Sep 13, 2019 at 1:46 PM Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:
On 9/13/19 1:33 PM, Ray Donnelly wrote:

On Fri, Sep 13, 2019 at 11:53 AM Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:

On 9/13/19 11:37 AM, IAGO GIN? V?ZQUEZ wrote:
But if I type
"?"
the output is
[1] "?"
so seemingly it can be represented. Or, am I wrong?
In RGui you can print the string, because RGui is a Windows Unicode
application (uses UTF16-LE and bypasses the C runtime for strings). But
it is just the gui, R itself (and hence also packages) use the current
native encoding as defined by the C runtime. RGui will make sure R gets
the string in UTF-8, but as soon as you do anything even slightly
non-trivial, which includes formatting, the string will be converted to
the current native encoding. Some R functions allow you to do certain
things in UTF-8 without conversion to native encoding, you'd have to
read very carefully the documentation for each function - but for
practical use, you either need to live with the misinterpretation of
some characters, or use Windows in the locale where your characters can
be represented (e.g. Chinese locale when working with Chinese strings),
or use Linux/maOS. On Linux/macOS the current native encoding can be
UTF-8, so there is no problem. On Windows, with the current toolchain
based on mingw, this is not possible.

mingw-w64 is capable of processing utf-8 (it can process bytes after all).
Can you explain what you mean here? Would any other compiler on Windows not
suffer from this problem?

The problem is using UTF-8 as the current locale as understood by the C
runtime/C library. By default mingw uses msvcrt, which does not allow UTF-8
as current locale (via setlocale()). Now mingw also allows to build with
UCRT (recently), and I hope one day we will be able to use it, but it is
not yet the default, msys2 does not use it yet for its mingw_ packages and
we need also the external packages . Note that R (CRAN, and also BIOC)
provide binary versions of all packages for Windows, they need to build
them and they need all library dependencies. All of those would have to be
rebuilt with UCRT, which will be a huge task. Fixing R on its own to
support UTF-8 natively on Windows when the C runtime allows it won't be
hard, because R already can do it on Unix, but the problem is all the
dependencies.

Thanks. We build R for the Anaconda Distribution and are considering our
options around our Windows compilers, including the UCRT (and clang,
possibly from MSYS2, possibly from conda-forge, or a hybrid of some sort if
necessary).
Tomas

Best
Tomas

Best
Iago
------------------------------------------------------------------------
*De:* Tomas Kalibera <tomas.kalibera at gmail.com>
*Enviat el:* divendres, 13 de setembre de 2019 11:24
*Per a:* IAGO GIN? V?ZQUEZ <i.gine at pssjd.org>; r-devel at r-project.org
<r-devel at r-project.org>
*Tema:* Re: [Rd] Printing chinese characters (UTF-8) on R 3.5.2
-windows 10
On 9/13/19 11:01 AM, IAGO GIN? V?ZQUEZ wrote:
I have a chinese character on a data frame, but the output of
printing it is its UTF-8 code. Concretely, the character is ? and the
code is U+6703. Following the code I arrive to the instruction

base::format.default("?")
which prints

[1] "<U+6703>"

I do not know which is the extent of this behaviour either if it
follows on most recent versions of R.
Is it expected?
If you are running this on Windows in an encoding where the character
cannot be represented (e.g. non-Chinese locale), then yes, this is
expected behavior.

On Unix systems where R can run in UTF-8 encoding (Linux, macOS), the
character will be formatted/displayed properly.

Best
Tomas

Thank you!

Iago

       [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

        [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel