Skip to content

font encoding issue

3 messages · Denis Chabot, Martin Maechler, Simon Urbanek

#
Hi,

I like to comment my programs, and often I do so in French.
But accented vowels in a R editor window get screwed up in the R 
console, so that

# exemple ˆ suivre

becomes

 > # exemple à suivre

(other french accents become just as ugly).

I suppose this is a font encoding error. Can it be fixed, or is there 
something in R itself which prevents it from even displaying such 
characters?


Sincerely,

Denis Chabot
#
Denis> Hi, I like to comment my programs, and often I do so
    Denis> in French.  But accented vowels in a R editor window
    Denis> get screwed up in the R console, so that

    Denis> # exemple ? suivre

    Denis> becomes

    >> # exemple ?? suivre

(and this has been changed again by passing through the mail systems)

    Denis> (other french accents become just as ugly).

    Denis> I suppose this is a font encoding error. Can it be
    Denis> fixed, or is there something in R itself which
    Denis> prevents it from even displaying such characters?

No, it's not "R itself", since this works in quite a few other
circumstances in other R consoles: You should be able to use
accents even in strings and plot them, see 
 > example(text)
in R, and even in R object names :

E.g. (in a Linux console):

 > Z?ri <- "exemple ? suivre" # ? suivre
 > Z?ri
 [1] "exemple ? suivre"
 >

---------
However it is -- as you write -- an encoding issue,
and probably also depending quite a bit on your so
called "Locale" settings.  To learn more, e.g., see in R

 > apropos("locale")
 [1] "Sys.getlocale"  "Sys.localeconv" "Sys.setlocale"
       
and now, e.g.,
 > ?Sys.getlocale

Here (where I mainly work with Redhat Enterprise Linux) I've
explicitly turned off the Unicode locale (UTF-8 to be specific)
and reverted to "C" (or "POSIX"), by at least setting  'LANG=C'
or 'LANG=POSIX' instead of something like 'LANG=en_US.UTF-8'.

Look at what
  > Sys.getenv("LANG")
tells you, and consider
  > Sys.putenv(LANG = "C")


Martin Maechler, ETH Zurich
#
On Nov 23, 2004, at 9:17 PM, Denis Chabot wrote:

            
It's a bug and a feature of the R GUI ;).
Internally, R GUI uses UTF-8 encoding for text handing, including the 
editor. The idea was to have a localized GUI with support for any 
language and UTF-8 is the natively supported format in Cocoa. To make 
the mess even bigger, there was a bug in the GUI that converted the 
UTF-8 to vanilla C string at one point, thus resulting in the wrong 
behavior you spotted.

Now I have fixed that latter bug, such that your comments should appear 
undistorted now:

 > # exemple ? suivre

If this is all you need, get tonight's nightly build.
However, using UTF-8 in strings in R is not that easy. Even if all you 
want is to retain the UTF-8 contents (i.e. tell R to not worry about 
the encoding and just print back what it gets), the actual problem is 
that R escapes certain characters, regardless of the locale:

 > Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/C"
 > "M?ll"
[1] "M\303\274ll"

This means that the don't-worry-concept doesn't work. The latest info 
on encodings and UTF-8 I could find was for 1.8.1, but I suspect that 
nothing changed since: basically R has no UTF-8 support and there will 
be none unless someone with enough time, energy and skill will take up 
the task.

The bottom line is that I'll try to fix the GUI in a sense that it will 
use the locale-specific encoding in its internal representation and for 
all communication with R. The drawback will be that users on systems 
with different locales won't be able to use each other's files 
transparently. Still, this should fix things for users of more simple 
encodings (such as Latin1), but for more general support of UTF-8 or 
other multi-character encodings we will have to wait until there is a 
global solution in R.

Cheers,
Simon