UTF-8 and .Rd files

15 messages · Göran Broström, Hin-Tak Leung, Peter Dalgaard +4 more

Original

1

15

Göran Broström

Tue, Jun 27, 2006 2:30 AM #

I have been converting to utf8 from latin1, and this gives me
problems, some solved, but here is one unsolved: In my .Rd files, I
have included '\encoding{UTF-8}' at the top. Despite this, the HTML
help pages contains 'content="text/html; charset=iso-8859-1"', and my
name is mangled. What can I do about this?

I'm on Ubuntu (latest), R-2.3.1

Thanks,

G?ran

Brian Ripley

Tue, Jun 27, 2006 3:11 AM #

On Tue, 27 Jun 2006, G?ran Brostr?m wrote:

Reproducible example, please!  (I've just tried this and it works for me.)

As described in my talk at UseR 2006, you may well not want to do this if 
you intend to distribute the package.  Your name contains characters that 
are not in the fonts used in UTF-8 in non-European locales, and Windows 
users do no have ready access to UTF-8 viewers (even if they know the 
files are UTF-8).

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Göran Broström

Tue, Jun 27, 2006 5:46 AM #

On 6/27/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

Thanks for your answer! So this means that 'latin1' does not cause
problems for non-European locales and Windows users, I take it.

I really only need non-ascii to write the name ot the author (me)
correctly. I tried LaTeX code ({\"o}), but that didn't work. Is there
a way around this?

G?ran

G?ran Brostr?m

Tue, Jun 27, 2006 8:05 AM #

G?ran Brostr?m wrote:

The \"o character in my latin1 (iso 8859-1) man page says it is 0xF6
  F6 - LATIN SMALL LETTER O WITH DIAERESIS
The capital version is
  D6 -  LATIN CAPITAL LETTER O WITH DIAERESIS

in html I think you need to do &#F6; or something for that character to 
appear?

HTH

HTL

Tue, Jun 27, 2006 11:01 AM #

Hello, G?ran:

	  Have you considered the German solution:  "Goeran"?  (e.g., Wuertz 
for W?rtz)?

	  Be thankful that you aren't Russian or Greek or Arabic or Chinese, 
etc., for which there may be no standard transliteration into the Latin 
alphabet.

	  Sorry I can't be more helpful.

	  Spencer Graves
p.s.  When I'm with native Spanish speakers who don't know English, I 
pronounce my name very differently, like "Espencer Gra-ve", to match how 
they would pronounce my name when they see it written.  Similarly, I 
once heard a French Canadian take about his young son, Guillaume.  If 
you ask him in English, "What's your name?" he replies, "Bill".  If you 
ask the same question in French, he replies, "Guillaume".

Hin-Tak Leung wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Tue, Jun 27, 2006 11:22 AM #

Spencer Graves wrote:

Well, I have to live with that, being of one of the above mentioned 
catergories. Where it is important to have my own name in native form
in documents, I keep around a png, a eps with postscript type 1
font embedded, and a pdf from the eps for the odd pdflatex occasions.

It is going to be very hack-ish, but I wonder if it is possible to
utilise the fact that latex comments (%) are not the same as html 
comments (<!-- -->) and vice versa, to make things work.

I seems to recall somewhere in the R-extension manual about being about 
to do \alternatives{latex stuff}{ascii stuff} for alternatives
which are destined to appear in different converted output types.
(Prof Ripley at this point would probably tell me the exact page
number and references...)

Hin-Tak

      Sorry I can't be more helpful.

      Spencer Graves
p.s.  When I'm with native Spanish speakers who don't know English, I 
pronounce my name very differently, like "Espencer Gra-ve", to match how 
they would pronounce my name when they see it written.  Similarly, I 
once heard a French Canadian take about his young son, Guillaume.  If 
you ask him in English, "What's your name?" he replies, "Bill".  If you 
ask the same question in French, he replies, "Guillaume".

Hin-Tak Leung wrote:

G?ran Brostr?m wrote:

On 6/27/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

On Tue, 27 Jun 2006, G?ran Brostr?m wrote:

I have been converting to utf8 from latin1, and this gives me
problems, some solved, but here is one unsolved: In my .Rd files, I
have included '\encoding{UTF-8}' at the top. Despite this, the HTML
help pages contains 'content="text/html; charset=iso-8859-1"', and my
name is mangled. What can I do about this?

Reproducible example, please!  (I've just tried this and it works 
for me.)

As described in my talk at UseR 2006, you may well not want to do 
this if
you intend to distribute the package.  Your name contains characters 
that
are not in the fonts used in UTF-8 in non-European locales, and Windows
users do no have ready access to UTF-8 viewers (even if they know the
files are UTF-8).

Thanks for your answer! So this means that 'latin1' does not cause
problems for non-European locales and Windows users, I take it.

I really only need non-ascii to write the name ot the author (me)
correctly. I tried LaTeX code ({\"o}), but that didn't work. Is there
a way around this?

G?ran

The \"o character in my latin1 (iso 8859-1) man page says it is 0xF6
  F6 - LATIN SMALL LETTER O WITH DIAERESIS
The capital version is
  D6 -  LATIN CAPITAL LETTER O WITH DIAERESIS

in html I think you need to do &#F6; or something for that character 
to appear?

HTH

HTL

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Brian Ripley

Tue, Jun 27, 2006 11:22 AM #

We describe how to use \enc for possible transliterations for exactly this 
purpose in the `Writing R Extensions' manual.

In answer to G?ran's question, yes latin1 is safer than UTF-8 for HTML 
browsers but neither are guaranteed to contain a glyph for ? in a font 
used e.g. in a Russian locale.

On Tue, 27 Jun 2006, Spencer Graves wrote:

Hello, G?ran:

	  Have you considered the German solution:  "Goeran"?  (e.g., Wuertz 
for W?rtz)?

	  Be thankful that you aren't Russian or Greek or Arabic or Chinese, 
etc., for which there may be no standard transliteration into the Latin 
alphabet.

	  Sorry I can't be more helpful.

	  Spencer Graves
p.s.  When I'm with native Spanish speakers who don't know English, I 
pronounce my name very differently, like "Espencer Gra-ve", to match how they 
would pronounce my name when they see it written.  Similarly, I once heard a 
French Canadian take about his young son, Guillaume.  If you ask him in 
English, "What's your name?" he replies, "Bill".  If you ask the same 
question in French, he replies, "Guillaume".

Hin-Tak Leung wrote:

G?ran Brostr?m wrote:

On 6/27/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:

On Tue, 27 Jun 2006, G?ran Brostr?m wrote:

I have been converting to utf8 from latin1, and this gives me
problems, some solved, but here is one unsolved: In my .Rd files, I
have included '\encoding{UTF-8}' at the top. Despite this, the HTML
help pages contains 'content="text/html; charset=iso-8859-1"', and my
name is mangled. What can I do about this?

Reproducible example, please!  (I've just tried this and it works for 
me.)

As described in my talk at UseR 2006, you may well not want to do this if
you intend to distribute the package.  Your name contains characters that
are not in the fonts used in UTF-8 in non-European locales, and Windows
users do no have ready access to UTF-8 viewers (even if they know the
files are UTF-8).

Thanks for your answer! So this means that 'latin1' does not cause
problems for non-European locales and Windows users, I take it.

I really only need non-ascii to write the name ot the author (me)
correctly. I tried LaTeX code ({\"o}), but that didn't work. Is there
a way around this?

G?ran

The \"o character in my latin1 (iso 8859-1) man page says it is 0xF6
  F6 - LATIN SMALL LETTER O WITH DIAERESIS
The capital version is
  D6 -  LATIN CAPITAL LETTER O WITH DIAERESIS

in html I think you need to do &#F6; or something for that character to 
appear?

HTH

HTL

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Göran Broström

Tue, Jun 27, 2006 11:30 AM #

On 6/27/06, Spencer Graves <spencer.graves at pdf.com> wrote:

Yes, but really not; I like your p.s. solution better!

Good idea! I call myself "George" in English, "Yuri" in Russian,
"Goran" on Balkan, etc.

Seriously, I thoght that unicode and utf8 would make problems like
this disappear, but obviously we may have to wait another 30 years.

Thanks for all the input.

George

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

G?ran Brostr?m

Tue, Jun 27, 2006 2:11 PM #

"G?ran Brostr?m" <goran.brostrom at gmail.com> writes:

Well, I do tend to think that we should just use utf, assuming that
people have the relevant glyphs. If they don't, then they might get
little hollow rectangles but so what? (This entails stamping out the
use of iso-8859-? which I think I have previously pointed out as the
historical mistake. Easier said than done, though, especially since
8859-1, er, -15 managed to get established as a de facto standard
in a couple of key places like HTTP and NNTP.)

Transliterations are really abominable and completely ambiguous, e.g.
oe means o-umlaut in Swedish and German, but o-slash in Danish and
Norwegian, and we already have at least two interpretations of "roer"
where oe represents two distinct vowels...

        piotr

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Paul Gilbert

Wed, Jun 28, 2006 1:39 PM #

I've been following this thread hoping for the definitive answer...

Peter Dalgaard wrote:

....

My problem is that I put an ? in a reference in an Rd file, and now my 
builds fail on some of my systems. I can switch which systems work and 
which are broken, but I can not get it to work on all systems. I have 
spent way too much time trying to figure out what is wrong. So, wrt "so 
what", I need to choose between checking my packages on all the 
different systems I use, or having an ? in the Rd file. I think my 
problem is more complicated than having the relevant glyphs. I suspect 
it has to do with having the same locale on all systems doing NFS 
mounts, or on my cvs server, or something strange like that.

Paul
====================================================================================

La version fran?aise suit le texte anglais.

------------------------------------------------------------------------------------

This email may contain privileged and/or confidential inform...{{dropped}}

Wed, Jun 28, 2006 2:20 PM #

Paul Gilbert <pgilbert at bank-banque-canada.ca> writes:

Just to clarify, one thing is what I feel should be the longer term
strategy, another is what the R build tools can currently do...

Did you follow the advice to declare your input encoding with
\encoding and use \enc to provide a transliteration?

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Wed, Jun 28, 2006 5:13 PM #

Hi, Paul:

	  Earlier in this thread, G?ran Brostr?m wrote, "I really only need 
non-ascii to write the name of the author (me) correctly."

	  The standard advice I got from a similar thread some time ago is to 
use the 'vanilla' Latin alphabet for key words, file and function names, 
etc., and restrict the use of other characters to documentation where 
the consequences of problems are not so severe.  I, too, would like to 
see all the accents, Arabic script, Chinese characters, etc., that other 
people want to use.  However, we must work with the world as it is, not 
as we would like it to be (while devoting some time where appropriate to 
making the world better, as everyone who contributes to the R Project 
does).

	  Best Wishes,
	  Spencer Graves

Paul Gilbert wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Brian Ripley

Wed, Jun 28, 2006 11:54 PM #

On Wed, 28 Jun 2006, Peter Dalgaard wrote:

Unfortunately, they might get nothing visible at all, and they might also 
get something completely wrong (happens on my Windows' X11 server on my 
laptop).  This is not an R problem but a question of the quality of 
implementation of UTF-8.  (Given the lack of UTF-8 fonts, I don't see the 
latter changing any time soon.)

My comments (at UseR and to G?ran) are intended to make people aware just 
how badly things can go wrong: it is up to the users to decide if 
transliteration is worse than the chance of mangling.

It is necessary to do so.  I use a mixture of UTF-8 and latin1 locales on 
systems sharing a file system, and it all works for me: iconv does the 
charset translations transparently provided it knows what to do.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Paul Gilbert

Thu, Jun 29, 2006 8:09 AM #

Prof Brian Ripley wrote:

On Wed, 28 Jun 2006, Peter Dalgaard wrote:

Paul Gilbert <pgilbert at bank-banque-canada.ca> writes:

I've been following this thread hoping for the definitive answer...

Peter Dalgaard wrote:
....

Well, I do tend to think that we should just use utf, assuming that
people have the relevant glyphs. If they don't, then they might get
little hollow rectangles but so what?

Unfortunately, they might get nothing visible at all, and they might 
also get something completely wrong (happens on my Windows' X11 server 
on my laptop).  This is not an R problem but a question of the quality 
of implementation of UTF-8.  (Given the lack of UTF-8 fonts, I don't 
see the latter changing any time soon.)

My comments (at UseR and to G?ran) are intended to make people aware 
just how badly things can go wrong: it is up to the users to decide if 
transliteration is worse than the chance of mangling.

My problem is that I put an ? in a reference in an Rd file, and now my
builds fail on some of my systems. I can switch which systems work and
which are broken, but I can not get it to work on all systems. I have
spent way too much time trying to figure out what is wrong. So, wrt "so
what", I need to choose between checking my packages on all the
different systems I use, or having an ? in the Rd file. I think my
problem is more complicated than having the relevant glyphs. I suspect
it has to do with having the same locale on all systems doing NFS
mounts, or on my cvs server, or something strange like that.


Just to clarify, one thing is what I feel should be the longer term
strategy, another is what the R build tools can currently do...

Did you follow the advice to declare your input encoding with
\encoding and use \enc to provide a transliteration?

It has been several months since I did this, but I thought I had 
followed all the instructions.

Ok,   I will try again sometime when I have a bit more time.

Thanks,
Paul
====================================================================================

La version fran?aise suit le texte anglais.

------------------------------------------------------------------------------------

This email may contain privileged and/or confidential inform...{{dropped}}

François Pinard

Thu, Jun 29, 2006 1:33 PM #

[Spencer Graves]

Granted and agreed.  Yet, R already does already a little more than 
a few other programming languages in this area, and this is particularly 
sympathetic! :-)  One could hope and wish that R developers, within 
reasonable efforts, continue making R better and even make it take some 
lead in this area.  Not going fanatic about it of course, but at least, 
carefully avoiding any backward move in development, or changes that 
would be unfriendly to internationalisation of R.

Fran?ois Pinard   http://pinard.progiciels-bpi.ca