An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20081230/706dd70b/attachment.pl>
issue with encoding in R-2.8.1 invalid multibyte character
8 messages · Wijffels, Jan, Peter Dalgaard, Brian Ripley
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
On Tue, 30 Dec 2008, Wijffels, Jan wrote:
Hi, We recently switched from R2.7.0 to R2.8.1 but having problems tracking down this 'invalid multibyte character' encoding issue. Can someone point us how to solve this?
sessionInfo()
R version 2.8.1 (2008-12-22) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base
print("?")
Error: invalid multibyte character in parser at line 1
sessionInfo()
R version 2.7.0 (2008-04-22) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base
print("?")
[1] "\xa7" Thanks, Jan Wijffels Statistical Analyst www.thomascook.be | +32 9 241 1709 [[alternative HTML version deleted]]
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Prof Brian Ripley wrote:
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
We see different things. I see a section sign (double s) symbol. From the symptoms, I would suspect that the terminal is set to latin-1 or -15 (both have the section sign at 0xa7) even though the system (and thus R) is utf-8. (Incidentally said 0xa7 is know as the "paragraph" symbol in Danish legal texts, whereas the paragraph symbol at 0xb6 is largely unknown.)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Wed, 31 Dec 2008, Peter Dalgaard wrote:
Prof Brian Ripley wrote:
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
We see different things.
Right, and my point is that we do not know what he actually sees.
I see a section sign (double s) symbol. From the symptoms, I would suspect that the terminal is set to latin-1 or -15 (both have the section sign at 0xa7) even though the system (and thus R) is utf-8.
I thought of that, but if the system is in UTF-8, so would its keyboard be. Perhaps this is a remote session from a Windows system to a UTF-8 one? (In which case set the remote locale appropriately.) The issue seemed to be about entering Latin characters (-1 or -9, I think: latin-9 is ISO 8859-15), and that is what I tried to answer.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Yes, it was the section sign (double s) symbol that I was trying to print connecting from a Windows machine with Latin1 encoding to a UTF-8 Linux machine. I changed the translation behaviour in my Putty SSH from Latin1 to UTF-8 and now the interactive R programming works. My scripts which I run with Rscript my_script.r contain quite some Latin-1 characters. These ran ok in R2.7.0 but not any more in R2.8.1 but I presume this is because in 2.7.1 the changes made to the system indicated 'The parser sometimes accepted invalid quoted strings in a UTF-8 locale'. So this means for me I need to change the scripts I develop in Latin1 on Windows to UTF-8 before I upload them to our server. Thanks for the help. -----Oorspronkelijk bericht----- Van: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] Verzonden: woensdag 31 december 2008 9:22 Aan: Peter Dalgaard CC: Wijffels, Jan; r-help at r-project.org Onderwerp: Re: [R] issue with encoding in R-2.8.1 invalid multibyte character
On Wed, 31 Dec 2008, Peter Dalgaard wrote:
Prof Brian Ripley wrote:
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
We see different things.
Right, and my point is that we do not know what he actually sees.
I see a section sign (double s) symbol. From the symptoms, I would suspect that the terminal is set to latin-1 or -15
(both
have the section sign at 0xa7) even though the system (and thus R) is
utf-8. I thought of that, but if the system is in UTF-8, so would its keyboard be. Perhaps this is a remote session from a Windows system to a UTF-8 one? (In which case set the remote locale appropriately.) The issue seemed to be about entering Latin characters (-1 or -9, I think: latin-9 is ISO 8859-15), and that is what I tried to answer.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Wed, 31 Dec 2008, Wijffels, Jan wrote:
Yes, it was the section sign (double s) symbol that I was trying to print connecting from a Windows machine with Latin1 encoding to a UTF-8 Linux machine. I changed the translation behaviour in my Putty SSH from Latin1 to UTF-8 and now the interactive R programming works. My scripts which I run with Rscript my_script.r contain quite some Latin-1 characters. These ran ok in R2.7.0 but not any more in R2.8.1 but I presume this is because in 2.7.1 the changes made to the system indicated 'The parser sometimes accepted invalid quoted strings in a UTF-8 locale'. So this means for me I need to change the scripts I develop in Latin1 on Windows to UTF-8 before I upload them to our server.
Or, as I suggested below, run the R session on the server in Latin1. % LC_ALL=nl_BE R (guessing, or use en_US) should do it.
Thanks for the help. -----Oorspronkelijk bericht----- Van: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] Verzonden: woensdag 31 december 2008 9:22 Aan: Peter Dalgaard CC: Wijffels, Jan; r-help at r-project.org Onderwerp: Re: [R] issue with encoding in R-2.8.1 invalid multibyte character On Wed, 31 Dec 2008, Peter Dalgaard wrote:
Prof Brian Ripley wrote:
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
We see different things.
Right, and my point is that we do not know what he actually sees.
I see a section sign (double s) symbol. From the symptoms, I would suspect that the terminal is set to latin-1 or -15
(both
have the section sign at 0xa7) even though the system (and thus R) is
utf-8. I thought of that, but if the system is in UTF-8, so would its keyboard be. Perhaps this is a remote session from a Windows system to a UTF-8 one? (In which case set the remote locale appropriately.) The issue seemed to be about entering Latin characters (-1 or -9, I think: latin-9 is ISO 8859-15), and that is what I tried to answer. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Wed, 31 Dec 2008, Wijffels, Jan wrote:
Yes, it was the section sign (double s) symbol that I was trying to print connecting from a Windows machine with Latin1 encoding to a
UTF-8
Linux machine. I changed the translation behaviour in my Putty SSH from Latin1 to
UTF-8
and now the interactive R programming works. My scripts which I run with Rscript my_script.r contain quite some Latin-1 characters. These ran ok in R2.7.0 but not any more in R2.8.1 but I presume this is because in 2.7.1 the changes made to the system indicated 'The parser sometimes accepted invalid quoted strings in a UTF-8 locale'. So this means for me I need to change the scripts I develop in Latin1
on
Windows to UTF-8 before I upload them to our server.
Or, as I suggested below, run the R session on the server in Latin1. % LC_ALL=nl_BE R (guessing, or use en_US) should do it. Even better :), thanks
Thanks for the help. -----Oorspronkelijk bericht----- Van: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] Verzonden: woensdag 31 december 2008 9:22 Aan: Peter Dalgaard CC: Wijffels, Jan; r-help at r-project.org Onderwerp: Re: [R] issue with encoding in R-2.8.1 invalid multibyte character On Wed, 31 Dec 2008, Peter Dalgaard wrote:
Prof Brian Ripley wrote:
Well, we don't see what you see. but if ? was hex a7, the message is entirely correct. If you want to enter that, use "\xa7".
We see different things.
Right, and my point is that we do not know what he actually sees.
I see a section sign (double s) symbol. From the symptoms, I would suspect that the terminal is set to latin-1 or -15
(both
have the section sign at 0xa7) even though the system (and thus R) is
utf-8. I thought of that, but if the system is in UTF-8, so would its
keyboard
be. Perhaps this is a remote session from a Windows system to a UTF-8 one? (In which case set the remote locale appropriately.) The issue seemed to be about entering Latin characters (-1 or -9, I think: latin-9 is ISO 8859-15), and that is what I tried to answer. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Wijffels, Jan wrote:
...
So this means for me I need to change the scripts I develop in Latin1
on
Windows to UTF-8 before I upload them to our server.
Or, as I suggested below, run the R session on the server in Latin1. % LC_ALL=nl_BE R (guessing, or use en_US) should do it.
Even better :), thanks
Otherwise, depending on your workflow, you might find that "iconv" is your friend. (Notice that the above will give you output in Latin1 too, which may be exactly what you need, but then again maybe not.)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907