Skip to content

Problem with UTF-8 text in the Rcmdr package

6 messages · Jaro.Lajovic, Brian Ripley, John Fox

#
Dear list members,

I've attached some email correspondence with Jaro Lajovic (with his permission), detailing a problem with the Slovenian translation file for the Rcmdr package. 

In brief, while certain UTF-8 characters used in Slovenian used to appear properly in older versions of R, some characters do not display properly in the Rcmdr menus and output window under R 2.7.x. I've confirmed the problem with the current version of the Rcmdr package (1.4-0) and R 2.7.2 under Windows Vista.

I've checked the R docs and NEWS file for changes to R, but wasn't able to turn up anything that seemed relevant. Frankly, however, my understanding of how various character sets are handled is only partial.

Any help would be appreciated.

John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


-----Original Message-----
From: Jaro.Lajovic [mailto:Jaro.Lajovic at mf.uni-lj.si] 
Sent: August-26-08 2:57 AM
To: John Fox
Subject: Re: Slovenian Rcmdr .po and .mo - and a problem

Dear John,
As for other translated R packages, I am afraid I am not aware of any. 
However, a quick test using cat with special characters:
cat "??????\n"
reveals that the string prints OK in the R (2.7.1.) console. The command 
line also shows OK in the Rcmdr Script window, but does not display 
right in the Output window. Special chars also fail in the Messages window.

Input (Script window) thus seems not to be affected, while the menu 
system and output do not work properly.

Thank you very much,
Jaro
1 day later
#
The issue appears to be the Rcmdr output window and menus.  They are done 
using Tcl/Tk, not by R.  So this might be a problem in Tcl/Tk or the fonts 
it uses, or it might be problem with what Rcmdr passes to the tcltk 
package.

We need the means to reproduce this (as per the posting guide):

- what OSes are affected?  Does this occur in a UTF-8 locale on Linux, for 
example?

- in what locales?

- what versions of Tcl/Tk?  Note that shipped with Windows R 
changed between 2.5.1 and 2.7.x.

- Is this anything to do with translations?  I've not looked at how 
translations are done in Rcmdr, but if gettext() is used, the string 
passed to R for output is in the native encoding, so 'UTF-8 characters' is 
incorrect.  It is possible that it is an iconv problem if the translations 
are supplied in UTF-8 and not Latin-2.

There are far too many layers involved here to guess at what is going on.
My guess is that it ought to be possible to give a simple example of a 
string which can be output to the Rcmdr console and will be rendered 
incorrectly (together with a screen shot of how it is rendered).

I think the characters referred to are the Unicode glyphs 's and z with 
caron', \u0161 and \u017E.  It seems that these will only be displayable 
in Rcmdr on Windows in a Latin-2 locale, which I do not have set up on 
Windows (but believe I could get installed).  However, examples using that 
(and the menus) seem to be correct in both sl_SI.iso88592 and sl_SI.utf8 
on Linux, which suggests that this is probably not an R issue but a Tcl/Tk 
one.
On Fri, 5 Sep 2008, John Fox wrote:

            
Unfortunately, it is not 'detailed', and we do need the details.

  
    
#
Dear Brian,

Thank you for addressing the problem -- I was hoping that you would.
Jaro provides an example in one of his messages in my posting (though it is
slightly in error): If one enters 

cat("??????\n") 

in the Rcmdr Script window, the characters are rendered correctly. Executing
this command (via the Submit button) produces the following in the Output
window:
??????

which actually appears as
??

This is under Windows Vista / R 2.7.2 / Rcmdr 1.4-0.
I've now checked under Mac OS X and Linux Ubuntu, with the following
results:

Under Mac OS X 10.5.4 / R 2.7.2 / Rcmdr 1.4-0 / Tcl/Tk 8.4 

cat("??????\n") appears as cat("?????\n") in *both* the Script window and
the Output window.

Under Ubuntu Linux 8.04 / R 2.7.0 / Rcmdr 1.4-0/ Tcl/Tk 8.5

cat("??????\n") appears *correctly* in *both* the Script window and the
Output window.
I'm afraid that I don't know how to check this short of changing the locale
for my Windows machine. I do observe the problem in Windows when I start
Rgui with language=sl.
Yes, and please see above, but if the problem were with Tcl/Tk, why does
this work in the Script window under Windows and in both Script and Output
under Ubuntu?
Yes, the Rcmdr package uses gettext(). Could Jaro avoid the problem by using
Latin-2 in preference to UTF-8?
Indeed, please see above. I've also attached a screenshot under Windows,
having started R with language=sl.
I'm above my depth with respect to these issues, but I do find it curious
that under Windows the characters appears correctly in the Script window but
not the Output window.
I hope that the additional information in this message will supply at least
some of the necessary details.

Thank you for your help,
 John
window.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: screenshot.pdf
Type: application/pdf
Size: 58260 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20080907/0f9d413e/attachment.pdf>
#
Dear John,

 >> - Is this anything to do with translations?  I've not looked at how
 >> translations are done in Rcmdr, but if gettext() is used, the string
 >> passed to R for output is in the native encoding, so 'UTF-8
 >> characters' is incorrect.  It is possible that it is an iconv problem
 >> if the translations are supplied in UTF-8 and not Latin-2.
 >
 > Yes, the Rcmdr package uses gettext(). Could Jaro avoid the problem by
 > using Latin-2 in preference to UTF-8?

As mentioned, I am testing this under Windows XP (R 2.7.1).

Preparing the .mo file with the Latin-2 encoding (or Win-1250, for that 
matter) does not make any difference.

However, with the help of my son I have made a test, documented in the 
attached screenshot. It seems that the output routines expect Latin-2, 
but (as for the translation) get the native encoding.

Best regards,
Jaro



John Fox pravi:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rcmdr_czs_demo.png
Type: image/png
Size: 20628 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20080908/5d4b5341/attachment.png>
#
Unless Windows is running in CP1250 (the Slovenian encoding on Windows), 
this is not expected to work.  I believe John tested in CP1252, and it 
just so happens that those characters are in the same place in CP1250 and 
CP1252.

I get something different in CP1250, as pasting into the script window 
also does not work.  But if I use the Unicode escapes, the result in the 
output Window is rendered correctly in the output window.

I think Jaro has put his finger on this: Tcl/Tk output thinks it is in 
Latin-2 and not CP1250, and s and z caron have different positions in 
those two character sets.  Here is something I can reproduce easily: with 
XP set to Slovenian:
[1] "??????"
[1] c8 8a 8e e8 9a 9e

which is correct for CP1250.  Now if I submit 'x' in the Rcmdr script 
window, I get the wrong output in the output window.

And I've tracked that down to a bug in iconv (something we take from 
libiconv on Windows): it does think the native encoding is Latin-2, not 
CP1252.  I'll put a workaround in R-devel and R-patched shortly.  That has 
other potential ramifications that will take me longer to investigate, and 
correct thing may be to fix iconv.
On Sun, 7 Sep 2008, John Fox wrote:

            

  
    
#
Dear Brian,
On
this
Yes, that's right: My locale is English (Canada), which uses CP1252.
also
those
to
window,
libiconv
I'll
potential
may
Thank you very much for tracking this down. 

Recall that there is also apparently a problem under Mac OS X, where the
characters appeared incorrectly in both the Script and Output windows.

Regards,
 John
on.
rendered).
partial.