Skip to content

[Bug report] Chinese characters are not handled correctly in Rterm for Windows

2 messages · Tomas Kalibera, Azure

#
Thank you for the report and initial debugging. I am not sure what is 
going wrong, we may have to rely on your help to debug this (I do not 
have a system to reproduce on). A user-targeted advice would be to use 
RGui (Rgui.exe).

Does the problem also exist in R-devel?
https://cran.r-project.org/bin/windows/base/rdevel.html

Your example? print("ABC\u4f60\u597dDEF") is printing two Chinese 
characters, right? The first one is C4E3 in CP936 (4F60 in Unicode) and 
the second one is BAC3 in CP936 (597D in Unicode)? Could you reproduce 
the problem with printing just one of the characters, say 
print("ABC\u4f60DEF") ?

As a sanity check - does this display the correct characters in RGui? It 
should, and does on my system, as RGui uses Unicode internally. By 
correct I mean the characters shown e.g. here

https://msdn.microsoft.com/en-us/library/cc194923.aspx
https://msdn.microsoft.com/en-us/library/cc194920.aspx

What is the output of "chcp" in the terminal, before you run R.exe? It 
may be different from what Sys.getlocale() gives in R.

If you take the sequence of the "fputc" commands you captured by the 
debugger, and create a trivial console application to just run them - 
would the characters display correctly in the same terminal from which 
you run R.exe?

Thanks
Tomas
On 03/08/2018 06:54 PM, Azure wrote:

  
  
23 days later
#
Hi Tomas,

Sorry for the delayed response. I have tested the problem on the latest R-devel build (2018-04-27 r74651), and it still exists. RGui is always fine with Chinese characters, but some IDEs rely on the CLI version of R (e.g. Visual Studio Code with R plugin).
Yes. U+4F60, U+597D or C4E3, BAC3 in CP936.
Yes. The console output is pasted in [ https://paste.ubuntu.com/p/TYgZWhdgXK/ ] (to avoid gibberish in e-mail).
The Active Code Page is 936 before and after running Rterm.
Yes.
Yes. I created an Win32 Console Application in VS [ https://paste.ubuntu.com/p/h3NFV6nQvs/ ], and all the characters were displayed correctly in two ways. The WriteConsoleA variant uses the current console CP settings, and it should behave like fputc. 

I guess the Rterm uses its own console I/O mechanism so the 2nd parameter of fputc is not stdout's handle. (I tried to read the source but unable to figure out how it works). The crash in mbcs_get_next, which is also mentioned in the previous post, may be related to this mechanism.

If you need further information, please let me know.

Thanks,
i at azurefx.name


Tomas Kalibera <tomas.kalibera at gmail.com> 2018/4/5 22:42