Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ). Sys.getlocale(category = "LC_ALL") yields
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and labels encoded in Latin-1, which I pass to a function. Some label shall generate the filename. Although the labels are correctly handled in JGR they are somehow converted when they are written to the file. Also the filename is not constructed as wanted. The function definition is correctly sourced into R. The function is defined like this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt", encoding = "UTF-8")
cat(as.character("<html xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\" xmlns=\"http://www.w3.org/TR/REC-html40\"> \n"),
as.character(" <head> \n"),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</td> \n"),
as.character(" </tr> "), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function with this argument, yields a filename "Itemtabelle ??rzte Innere, Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the resulting filename is "Itemtabelle ?rzte Innere, Gyn?kologie .html".
Cheers,
Matthias
encoding question again
7 messages · Matthias Wendel, Simon Urbanek, Brian Ripley
Matthias,
you get exactly what you specified - namely UTF-8. If you want your
html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding to
be understood by your OS. If it's not, you better convert it into one.
From your results it seems as if `abt' is also UTF-8 encoded. Since
you didn't tell us where you got that from, you should either fix the
source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
> abt="n?r"
> cat(abt,"\n")
n?r
> charToRaw(abt)
[1] 6e c3 bc 72
> charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72
Cheers,
Simon
On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly
handled in JGR they are somehow converted when they are written to
the file. Also the filename is not constructed as wanted. The
function definition is correctly sourced into R. The function is
defined like this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\" xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n
"),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file
= zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere, Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi, simon,
i followed your advice by adding/changing the lines
abt = iconv(abt,"utf-8","latin1")
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt", encoding = "latin1")
but this yielded the same results.
Cheers,
Matthias
-----Urspr?ngliche Nachricht-----
Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Gesendet: Donnerstag, 27. Dezember 2007 21:40
An: Matthias Wendel
Cc: r-devel at r-project.org
Betreff: Re: [Rd] encoding question again
Matthias,
you get exactly what you specified - namely UTF-8. If you want your html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt", encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding to be understood by your OS. If it's not, you better convert it into one.
From your results it seems as if `abt' is also UTF-8 encoded. Since you didn't tell us where you got that from, you should either fix the source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
> abt="n?r"
> cat(abt,"\n")
n?r
> charToRaw(abt)
[1] 6e c3 bc 72
> charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72
Cheers,
Simon
On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly handled
in JGR they are somehow converted when they are written to the file.
Also the filename is not constructed as wanted. The function
definition is correctly sourced into R. The function is defined like
this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html
xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n "),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere, Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
1 day later
Hallo Matthias,
On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
Hi, simon,
i followed your advice by adding/changing the lines
abt = iconv(abt,"utf-8","latin1")
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
but this yielded the same results.
Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird der
Dateiname richtig angelegt ...
Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt immer
UTF-8, aber das System liefert CP1252. Deswegen scheint die
automatische Konvertierung nicht zu funktionieren
(file(...,encoding..)). Was allerding immer geht, ist die explizite
Konvertierung:
a=file("foo","wt")
writeLines(iconv(..., "utf-8","latin1"),a)
close(a)
(FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
braucht man es eigentlich nicht wirklich ... ;))
charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist 2-
bytes bracht und latin1 nur eins.
Viele Gruesse,
Simon
-----Urspr?ngliche Nachricht-----
Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Gesendet: Donnerstag, 27. Dezember 2007 21:40
An: Matthias Wendel
Cc: r-devel at r-project.org
Betreff: Re: [Rd] encoding question again
Matthias,
you get exactly what you specified - namely UTF-8. If you want your
html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding
to be understood by your OS. If it's not, you better convert it into
one.
From your results it seems as if `abt' is also UTF-8 encoded. Since
you didn't tell us where you got that from, you should either fix
the source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
abt="n?r" cat(abt,"\n")
n?r
charToRaw(abt)
[1] 6e c3 bc 72
charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72 Cheers, Simon On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly
handled
in JGR they are somehow converted when they are written to the file.
Also the filename is not constructed as wanted. The function
definition is correctly sourced into R. The function is defined like
this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html
xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n "),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere, Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Oops, this was supposed to be a private reply ;) - sorry about the noise. The essence in English: JGR uses all strings in UTF-8 encoding, but the system locale reports CP1252 which impedes automatic conversions (because R doesn't know that everything is UTF-8). Specific conversion via iconv works as expected (see the example below). Cheers, Simon
On Dec 29, 2007, at 11:11 AM, Simon Urbanek wrote:
Hallo Matthias, On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
Hi, simon,
i followed your advice by adding/changing the lines
abt = iconv(abt,"utf-8","latin1")
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
but this yielded the same results.
Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird der
Dateiname richtig angelegt ...
Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt immer
UTF-8, aber das System liefert CP1252. Deswegen scheint die
automatische Konvertierung nicht zu funktionieren
(file(...,encoding..)). Was allerding immer geht, ist die explizite
Konvertierung:
a=file("foo","wt")
writeLines(iconv(..., "utf-8","latin1"),a)
close(a)
(FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
braucht man es eigentlich nicht wirklich ... ;))
charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist 2-
bytes bracht und latin1 nur eins.
Viele Gruesse,
Simon
-----Urspr?ngliche Nachricht-----
Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Gesendet: Donnerstag, 27. Dezember 2007 21:40
An: Matthias Wendel
Cc: r-devel at r-project.org
Betreff: Re: [Rd] encoding question again
Matthias,
you get exactly what you specified - namely UTF-8. If you want your
html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding
to be understood by your OS. If it's not, you better convert it into
one.
From your results it seems as if `abt' is also UTF-8 encoded. Since
you didn't tell us where you got that from, you should either fix
the source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
abt="n?r" cat(abt,"\n")
n?r
charToRaw(abt)
[1] 6e c3 bc 72
charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72 Cheers, Simon On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly
handled
in JGR they are somehow converted when they are written to the file.
Also the filename is not constructed as wanted. The function
definition is correctly sourced into R. The function is defined like
this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html
xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n "),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere,
Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Sat, 29 Dec 2007, Simon Urbanek wrote:
Oops, this was supposed to be a private reply ;) - sorry about the noise. The essence in English: JGR uses all strings in UTF-8 encoding, but the system locale reports CP1252 which impedes automatic conversions (because R doesn't know that everything is UTF-8). Specific conversion via iconv works as expected (see the example below).
On Windows there are no UTF-8 locales, but you can probably get the same effect by marking the strings via Encoding(), as they will be converted to CP1252 (a Latin-1 superset) on output. A console that is running in a non-native encoding needs to convert everything going to and from R. We've experimented with running R in UTF-8 on Windows, but then you need to convert _everything_ coming in and going out and (and this is the killer) so would every package with C-level I/O. (Tcl/Tk and Perl have gone down that route, and to a large extent left their extensions behind.)
Cheers, Simon On Dec 29, 2007, at 11:11 AM, Simon Urbanek wrote:
Hallo Matthias, On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
Hi, simon,
i followed your advice by adding/changing the lines
abt = iconv(abt,"utf-8","latin1")
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
but this yielded the same results.
Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird der
Dateiname richtig angelegt ...
Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt immer
UTF-8, aber das System liefert CP1252. Deswegen scheint die
automatische Konvertierung nicht zu funktionieren
(file(...,encoding..)). Was allerding immer geht, ist die explizite
Konvertierung:
a=file("foo","wt")
writeLines(iconv(..., "utf-8","latin1"),a)
close(a)
(FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
braucht man es eigentlich nicht wirklich ... ;))
charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist 2-
bytes bracht und latin1 nur eins.
Viele Gruesse,
Simon
-----Urspr?ngliche Nachricht-----
Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Gesendet: Donnerstag, 27. Dezember 2007 21:40
An: Matthias Wendel
Cc: r-devel at r-project.org
Betreff: Re: [Rd] encoding question again
Matthias,
you get exactly what you specified - namely UTF-8. If you want your
html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding
to be understood by your OS. If it's not, you better convert it into
one.
From your results it seems as if `abt' is also UTF-8 encoded. Since
you didn't tell us where you got that from, you should either fix
the source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
abt="n?r" cat(abt,"\n")
n?r
charToRaw(abt)
[1] 6e c3 bc 72
charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72 Cheers, Simon On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly
handled
in JGR they are somehow converted when they are written to the file.
Also the filename is not constructed as wanted. The function
definition is correctly sourced into R. The function is defined like
this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html
xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n "),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn????kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere,
Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
1 day later
Brian,
On Dec 29, 2007, at 12:28 PM, Prof Brian Ripley wrote:
On Sat, 29 Dec 2007, Simon Urbanek wrote:
Oops, this was supposed to be a private reply ;) - sorry about the noise. The essence in English: JGR uses all strings in UTF-8 encoding, but the system locale reports CP1252 which impedes automatic conversions (because R doesn't know that everything is UTF-8). Specific conversion via iconv works as expected (see the example below).
On Windows there are no UTF-8 locales, but you can probably get the same effect by marking the strings via Encoding(), as they will be converted to CP1252 (a Latin-1 superset) on output.
I was thinking about this before, but I don't have a good solution. The problem is that there are many places that may be affected. Especially all callbacks assume UTF-8 and since in R they are passed as char * they cannot be flagged. It is unfortunate, because JGR actually facilitates the use of UTF-8 nicely (e.g. you can create Japanese annotated plots regardless of the Windows locale), but it cannot pass that ability to R (except silently and sort of incorrectly). It is, however, surprising how far you can get despite this conflict (basically it works nicely as long as you don't talk to the system). Once we force some conversion on callbacks, we lose that advantage, so I'm still not sure what's the best solution. One semi- fix would be to take care of the latin1 locales and perform all conversions there, because they are so limited anyway, that users working in latin1 locales don't expect anything fancy to work anyway :).
A console that is running in a non-native encoding needs to convert everything going to and from R. We've experimented with running R in UTF-8 on Windows, but then you need to convert _everything_ coming in and going out and (and this is the killer) so would every package with C-level I/O. (Tcl/Tk and Perl have gone down that route, and to a large extent left their extensions behind.)
I agree. On the other hand, ideally there should be very little direct I/O in packages and even if it doesn't work in UTF-8, it won't make it unusable, just limited. Most projects adopted UTF-8 or unicode as the native encoding. I think we are on the right track (strings flagged with known encoding) and in the end we may end up using let's say UTF-8 internally and convert only for system calls. We may also end up supporting a similar concept (string+encoding) on the "edges" sooner or later: something like WriteConsoleWithEncoding(...) which could flag if possible instead of converting. Given that the embedding API needs some more consolidation, it may be a good time to tackle this as well. I'm hoping to do some cleanup and propose something as a part of the new ObjC API for R 2.7 and Mac GUI 2.0, so any input is welcome. Thanks, Simon
On Dec 29, 2007, at 11:11 AM, Simon Urbanek wrote:
Hallo Matthias, On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
Hi, simon,
i followed your advice by adding/changing the lines
abt = iconv(abt,"utf-8","latin1")
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
but this yielded the same results.
Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird
der
Dateiname richtig angelegt ...
Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt
immer
UTF-8, aber das System liefert CP1252. Deswegen scheint die
automatische Konvertierung nicht zu funktionieren
(file(...,encoding..)). Was allerding immer geht, ist die explizite
Konvertierung:
a=file("foo","wt")
writeLines(iconv(..., "utf-8","latin1"),a)
close(a)
(FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
braucht man es eigentlich nicht wirklich ... ;))
charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist
2-
bytes bracht und latin1 nur eins.
Viele Gruesse,
Simon
-----Urspr?ngliche Nachricht-----
Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
Gesendet: Donnerstag, 27. Dezember 2007 21:40
An: Matthias Wendel
Cc: r-devel at r-project.org
Betreff: Re: [Rd] encoding question again
Matthias,
you get exactly what you specified - namely UTF-8. If you want your
html file to be latin1, then you have to say so:
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "latin1")
In addition, you're assuming that `abt' is in the correct encoding
to be understood by your OS. If it's not, you better convert it
into
one.
From your results it seems as if `abt' is also UTF-8 encoded. Since
you didn't tell us where you got that from, you should either fix
the source or use something like iconv(abt,"utf-8","latin1"):
(in UTF-8 locale)
abt="n?r" cat(abt,"\n")
n?r
charToRaw(abt)
[1] 6e c3 bc 72
charToRaw(iconv(abt,"utf-8","latin1"))
[1] 6e fc 72 Cheers, Simon On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
Hi, R Devils,
I'm running the actual R version in JGR (version 1.5-8 ).
Sys.getlocale(category = "LC_ALL") yields [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
1252;LC_MONETARY=German_Germany.
1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
I want to write some HTML-Code enhanced by statistical results and
labels encoded in Latin-1, which I pass to a function. Some label
shall generate the filename. Although the labels are correctly
handled
in JGR they are somehow converted when they are written to the
file.
Also the filename is not constructed as wanted. The function
definition is correctly sourced into R. The function is defined
like
this:
Itemtabelle.head <- function (abt ){
# n?r z?m T?ST
zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
encoding = "UTF-8")
cat(as.character("<html
xmlns:o=\"urn:schemas-microsoft-com:office:office
\" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
xmlns=\"http://www.w3.org/TR/REC-html40
\"> \n"),
as.character("
<
head
\n "),
.
.
.
as.character(" <td colspan=5 class=xl28 width=727 style=
\'width:545pt\'>Gesundheitsindikatoren: "), abt, as.character("</
td> \n"),
as.character(" </
tr
"), file = zz)
close(zz)
unlink(zz)
}
Setting abt as " ?rzte Innere, Gyn?kologie" and calling the
function
with this argument, yields a filename "Itemtabelle ??rzte Innere,
Gyn??kologie .html" and in the file a line
<td colspan=5 class=xl28 width=727 style='width:
545pt'>Gesundheitsindikatoren: ????rzte Innere, Gyn???
?kologie </
td>
is generated. .
I tried to solve this by using iconv, without success.
The problem remains the same in the rgui and rterm - in rterm the
resulting filename is "Itemtabelle ?rzte Innere,
Gyn?kologie .html".
Cheers,
Matthias
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595