Wich character coding for source under Windows? - R-help

Fri, Jan 9, 2004 1:12 AM #

I know that R can cope with the different formats regarding carriage return
and/or line feed (the Unix, or Windows, or Mac convention), which is very
nice. However, it is not clear in my mind which character encoding is used:
ASCII, ANSI, other? There is not much differences between ANSI and DOS
encoding for instance, for the first 128 characters. But it is very
different for the rest.
Best,

Philippe Grosjean

.......................................................<?}))><....
 ) ) ) ) )
( ( ( ( (   Prof. Philippe Grosjean
\  ___   )
 \/ECO\ (   Numerical Ecology of Aquatic Systems
 /\___/  )  Mons-Hainaut University, Pentagone
/ ___  /(   8, Av. du Champ de Mars, 7000 Mons, Belgium
 /NUM\/  )
 \___/\ (   phone: + 32.65.37.34.97, fax: + 32.65.37.33.12
       \ )  email: Philippe.Grosjean at umh.ac.be
 ) ) ) ) )  SciViews project coordinator (http://www.sciviews.org)
( ( ( ( (
...................................................................

Brian Ripley

Fri, Jan 9, 2004 1:55 AM #

Unless you change it, no encoding is used.  That is, characters are just
treated as 8-bit numbers (as they are in all C programs).  Encodings are
only relevant if you want to display a character (or type at a keyboard),
and in general R assumes that you have set your fonts and keyboard to a 
single consistent encoding (which Petr Pikal had not).

You can reencode on input (See ?connections) and on output where there is
an encoding step (see ?postscript).  So if you have Mac files you can
reencode them on read transparently.  What you can't do is to re-encode
text files on output, mainly because there is no way to mark such files 
are encoded.

On Fri, 9 Jan 2004, Philippe Grosjean wrote:

I don't believe there is a single `DOS' encoding, rather a whole series of 
codepages.  And ASCII is a 7-bit encoding.  There are various wide 
encodings out there too.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Philippe GROSJEAN

Fri, Jan 9, 2004 3:21 AM #

OK, now with these infos and some experiment, it appears that the ANSI
encoding is used by default under Windows for source(), sink(), etc...
That is, if I understand correctly:

- source() that uses parse(file= ) is assuming nothing, because it just
reads bytes and the S language uses only characters among the first 128
ones, which are the same in ANSI or DOS encoding.
- sink() is consistent with this behaviour *under RGUI* and uses ANSI, as
does the default encoding for connections() with getOption("encoding) ==
0:255 assumes the same as does sink()

Now, my problem comes with Rterm... as it is a console program that uses DOS
encoding under Windows. So, with Rterm, there is a "translation" of the ANSI
characters sourced from a text file into DOS characters (for instance, those
in a cat(".....") instruction... and the reverse with sink(). Is this
inconsistent behaviour between Rgui and Rterm purposedly decided for some
reasons? Or is it just a consequence of the inconsistence between window
programs (Rgui) and command line programs (Rterm) under Windows?

Anyway, how could I use characters encoded over the 128th position in a
character string with source(), sink(), cat(), etc... and get the same
behaviour between Rgui and Rterm? Also, I suppose I would have problems with
such characters in Unix/Linux and MacOS, which would interpret them
differently?

Best,

Philippe Grosjean

.......................................................<?}))><....
 ) ) ) ) )
( ( ( ( (   Prof. Philippe Grosjean
\  ___   )
 \/ECO\ (   Numerical Ecology of Aquatic Systems
 /\___/  )  Mons-Hainaut University, Pentagone
/ ___  /(   8, Av. du Champ de Mars, 7000 Mons, Belgium
 /NUM\/  )
 \___/\ (   phone: + 32.65.37.34.97, fax: + 32.65.37.33.12
       \ )  email: Philippe.Grosjean at umh.ac.be
 ) ) ) ) )  SciViews project coordinator (http://www.sciviews.org)
( ( ( ( (
...................................................................

-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
Sent: Friday, 09 January, 2004 10:55
To: Philippe Grosjean
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Wich character coding for source under Windows?


Unless you change it, no encoding is used.  That is, characters are just
treated as 8-bit numbers (as they are in all C programs).  Encodings are
only relevant if you want to display a character (or type at a keyboard),
and in general R assumes that you have set your fonts and keyboard to a
single consistent encoding (which Petr Pikal had not).

You can reencode on input (See ?connections) and on output where there is
an encoding step (see ?postscript).  So if you have Mac files you can
reencode them on read transparently.  What you can't do is to re-encode
text files on output, mainly because there is no way to mark such files
are encoded.

On Fri, 9 Jan 2004, Philippe Grosjean wrote:

return

used:

I don't believe there is a single `DOS' encoding, rather a whole series of
codepages.  And ASCII is a 7-bit encoding.  There are various wide
encodings out there too.

--
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Brian Ripley

Fri, Jan 9, 2004 3:46 AM #

As I said, Rterm/Rgui do no encoding.  If you use cat or sink, the exact 
numeric char you used is written out.  Maybe if you *display* it you see 
something different, but I have already explained that.

Unless you do octal/hex dumps on files you will be confused by display 
encodings.

On Fri, 9 Jan 2004, Philippe Grosjean wrote:

No, it is the native encoding.  There is no `ANSI' encoding, but your 
machine is probably set up to use WinANSI (not ANSI).

Not true: S can use 8-bit characters.

You *do* get the same behaviour.  If you do example(text) you get the same 
chars in RGui and Rterm, even if 

options(pager="console")
help(text)

displays them differently.  That is nothing to do with Rterm, though.

And if you want to transfer files from Windows to another OS, you have to 
tell R on that OS what encoding you used.  That is all.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595