Skip to content

UTF-8 and .Rd files

15 messages · Göran Broström, Hin-Tak Leung, Peter Dalgaard +4 more

#
I have been converting to utf8 from latin1, and this gives me
problems, some solved, but here is one unsolved: In my .Rd files, I
have included '\encoding{UTF-8}' at the top. Despite this, the HTML
help pages contains 'content="text/html; charset=iso-8859-1"', and my
name is mangled. What can I do about this?

I'm on Ubuntu (latest), R-2.3.1

Thanks,

G?ran
#
On Tue, 27 Jun 2006, G?ran Brostr?m wrote:

            
Reproducible example, please!  (I've just tried this and it works for me.)

As described in my talk at UseR 2006, you may well not want to do this if 
you intend to distribute the package.  Your name contains characters that 
are not in the fonts used in UTF-8 in non-European locales, and Windows 
users do no have ready access to UTF-8 viewers (even if they know the 
files are UTF-8).

  
    
#
On 6/27/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Thanks for your answer! So this means that 'latin1' does not cause
problems for non-European locales and Windows users, I take it.

I really only need non-ascii to write the name ot the author (me)
correctly. I tried LaTeX code ({\"o}), but that didn't work. Is there
a way around this?

G?ran

  
    
#
G?ran Brostr?m wrote:
The \"o character in my latin1 (iso 8859-1) man page says it is 0xF6
  F6 - LATIN SMALL LETTER O WITH DIAERESIS
The capital version is
  D6 -  LATIN CAPITAL LETTER O WITH DIAERESIS

in html I think you need to do &#F6; or something for that character to 
appear?

HTH

HTL
#
Hello, G?ran:

	  Have you considered the German solution:  "Goeran"?  (e.g., Wuertz 
for W?rtz)?

	  Be thankful that you aren't Russian or Greek or Arabic or Chinese, 
etc., for which there may be no standard transliteration into the Latin 
alphabet.

	  Sorry I can't be more helpful.

	  Spencer Graves
p.s.  When I'm with native Spanish speakers who don't know English, I 
pronounce my name very differently, like "Espencer Gra-ve", to match how 
they would pronounce my name when they see it written.  Similarly, I 
once heard a French Canadian take about his young son, Guillaume.  If 
you ask him in English, "What's your name?" he replies, "Bill".  If you 
ask the same question in French, he replies, "Guillaume".
Hin-Tak Leung wrote:
#
Spencer Graves wrote:
Well, I have to live with that, being of one of the above mentioned 
catergories. Where it is important to have my own name in native form
in documents, I keep around a png, a eps with postscript type 1
font embedded, and a pdf from the eps for the odd pdflatex occasions.

It is going to be very hack-ish, but I wonder if it is possible to
utilise the fact that latex comments (%) are not the same as html 
comments (<!-- -->) and vice versa, to make things work.

I seems to recall somewhere in the R-extension manual about being about 
to do \alternatives{latex stuff}{ascii stuff} for alternatives
which are destined to appear in different converted output types.
(Prof Ripley at this point would probably tell me the exact page
number and references...)

Hin-Tak
#
We describe how to use \enc for possible transliterations for exactly this 
purpose in the `Writing R Extensions' manual.

In answer to G?ran's question, yes latin1 is safer than UTF-8 for HTML 
browsers but neither are guaranteed to contain a glyph for ? in a font 
used e.g. in a Russian locale.
On Tue, 27 Jun 2006, Spencer Graves wrote:

            

  
    
#
On 6/27/06, Spencer Graves <spencer.graves at pdf.com> wrote:
Yes, but really not; I like your p.s. solution better!
Good idea! I call myself "George" in English, "Yuri" in Russian,
"Goran" on Balkan, etc.

Seriously, I thoght that unicode and utf8 would make problems like
this disappear, but obviously we may have to wait another 30 years.

Thanks for all the input.

George

  
    
#
"G?ran Brostr?m" <goran.brostrom at gmail.com> writes:
Well, I do tend to think that we should just use utf, assuming that
people have the relevant glyphs. If they don't, then they might get
little hollow rectangles but so what? (This entails stamping out the
use of iso-8859-? which I think I have previously pointed out as the
historical mistake. Easier said than done, though, especially since
8859-1, er, -15 managed to get established as a de facto standard
in a couple of key places like HTTP and NNTP.)

Transliterations are really abominable and completely ambiguous, e.g.
oe means o-umlaut in Swedish and German, but o-slash in Danish and
Norwegian, and we already have at least two interpretations of "roer"
where oe represents two distinct vowels...

        piotr
#
I've been following this thread hoping for the definitive answer...
Peter Dalgaard wrote:
....
My problem is that I put an ? in a reference in an Rd file, and now my 
builds fail on some of my systems. I can switch which systems work and 
which are broken, but I can not get it to work on all systems. I have 
spent way too much time trying to figure out what is wrong. So, wrt "so 
what", I need to choose between checking my packages on all the 
different systems I use, or having an ? in the Rd file. I think my 
problem is more complicated than having the relevant glyphs. I suspect 
it has to do with having the same locale on all systems doing NFS 
mounts, or on my cvs server, or something strange like that.

Paul
====================================================================================

La version fran?aise suit le texte anglais.

------------------------------------------------------------------------------------

This email may contain privileged and/or confidential inform...{{dropped}}
#
Paul Gilbert <pgilbert at bank-banque-canada.ca> writes:
Just to clarify, one thing is what I feel should be the longer term
strategy, another is what the R build tools can currently do...

Did you follow the advice to declare your input encoding with
\encoding and use \enc to provide a transliteration?
#
Hi, Paul:

	  Earlier in this thread, G?ran Brostr?m wrote, "I really only need 
non-ascii to write the name of the author (me) correctly."

	  The standard advice I got from a similar thread some time ago is to 
use the 'vanilla' Latin alphabet for key words, file and function names, 
etc., and restrict the use of other characters to documentation where 
the consequences of problems are not so severe.  I, too, would like to 
see all the accents, Arabic script, Chinese characters, etc., that other 
people want to use.  However, we must work with the world as it is, not 
as we would like it to be (while devoting some time where appropriate to 
making the world better, as everyone who contributes to the R Project 
does).

	  Best Wishes,
	  Spencer Graves
Paul Gilbert wrote:
#
On Wed, 28 Jun 2006, Peter Dalgaard wrote:

            
Unfortunately, they might get nothing visible at all, and they might also 
get something completely wrong (happens on my Windows' X11 server on my 
laptop).  This is not an R problem but a question of the quality of 
implementation of UTF-8.  (Given the lack of UTF-8 fonts, I don't see the 
latter changing any time soon.)

My comments (at UseR and to G?ran) are intended to make people aware just 
how badly things can go wrong: it is up to the users to decide if 
transliteration is worse than the chance of mangling.
It is necessary to do so.  I use a mixture of UTF-8 and latin1 locales on 
systems sharing a file system, and it all works for me: iconv does the 
charset translations transparently provided it knows what to do.
#
Prof Brian Ripley wrote:

            
It has been several months since I did this, but I thought I had 
followed all the instructions.
Ok,   I will try again sometime when I have a bit more time.

Thanks,
Paul
====================================================================================

La version fran?aise suit le texte anglais.

------------------------------------------------------------------------------------

This email may contain privileged and/or confidential inform...{{dropped}}
#
[Spencer Graves]
Granted and agreed.  Yet, R already does already a little more than 
a few other programming languages in this area, and this is particularly 
sympathetic! :-)  One could hope and wish that R developers, within 
reasonable efforts, continue making R better and even make it take some 
lead in this area.  Not going fanatic about it of course, but at least, 
carefully avoiding any backward move in development, or changes that 
would be unfriendly to internationalisation of R.