Skip to content

issue with "strange" characters (locale settings)

2 messages · R.T.A.J.Leenders, Brian Ripley

#
WinXP-x32, R-21.13.0
   Dear list,
   I have a problem that (I think) relates to the interaction between Windows
   and R.
   I am trying to scrape a table with data on the Hawai'ian Islands, This is my
   code:
   library(XML)
   u <- "http://en.wikipedia.org/wiki/Hawaii"
   tables <- readHTMLTable(u)
   Islands <- tables[[5]]
   The output is (first set of columns):
          Island            Nickname                                           
                       > Islands
          Island            Nickname                                           
                       Location
1    Hawai????i[7]      The Big Island     19????34??????N 155????30??????W?????? / ??????19.567
????N 155.5????W?????? / 19.567; -155.5
2        Maui[8]     The Valley Isle     20????48??????N 156????20??????W?????? / ??????20.8????N
 156.333????W?????? / 20.8; -156.333
3 Kaho????olawe[9]     The Target Isle       20????33??????N 156????36??????W?????? / ??????20.55
????N 156.6????W?????? / 20.55; -156.6
4   L??na????i[10]  The Pineapple Isle 20????50??????N 156????56??????W?????? / ??????20.833????N 15
6.933????W?????? / 20.833; -156.933
5  Moloka????i[11]   The Friendly Isle 21????08??????N 157????02??????W?????? / ??????21.133????N 1
57.033????W?????? / 21.133; -157.033
6     O????ahu[12] The Gathering Place 21????28??????N 157????59??????W?????? / ??????21.467????N 1
57.983????W?????? / 21.467; -157.983
7    Kaua????i[13]     The Garden Isle     22????05??????N 159????30??????W?????? / ??????22.083
????N 159.5????W?????? / 22.083; -159.5
8   Ni????ihau[14]  The Forbidden Isle     21????54??????N 160????10??????W?????? / ??????21.9????N
 160.167????W?????? / 21.9; -160.167

   As you can see, there are "weird" characters in there. I have also tried
   readHTMLTable(u,  encoding = "UTF-16") and readHTMLTable(u, encoding =
   "UTF-8")
   but that didn't help.
   It  seems to me that there may be an issue with the interaction of the
   Windows settings of the character set.
   sessionInfo() gives
   > sessionInfo()
   R version 2.13.0 (2011-04-13)
   Platform: i386-pc-mingw32/i386 (32-bit)
   locale:
   [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
   LC_MONETARY=Dutch_Netherlands.1252
   [4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base
   other attached packages:
   [1] XML_3.2-0.2
   >
   I  have  also  attempted  to  let  R  use another setting by entering:
   Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
   > Sys.setlocale("LC_ALL", "en_US.UTF-8")
   [1] ""
   Warning message:
   In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
     OS reports request to set locale to "en_US.UTF-8" cannot be honored
   >
   In addition, I have attempted to make the change directly from the windows
   command prompt, using: "chcp 65001" and variations of that, but that didn't
   change anything.
   I have searched the list and the web and have found others bringing forth a
   similar issues, but have not been able to find a solution. I looks like this
   is  an  issue  of how Windows and R interact. Unfortunately, all three
   computers at my disposal have this problem. It occurs both under WinXP-x32
   and under Win7-x86.
   Is there a way to make R override the windows settings or can the issue be
   solved otherwise?
   I have also tried other websites, and the issue occurs every time when there
   is an ??, ??, ??, ??, et cetera in the text-to-be-scraped.
   Thank you,
   Roger
#
Oh, please!

This is about the contributed package XML, not R and not Windows.
Some of us have worked very hard to provide reasonable font support in 
R, including on Windows.  We are given exceedingly little credit, just
the brickbats for things for which we are not responsible.  (We even 
work hard to port XML to Windows for you, again with almost zero 
credit.)

That URL is a page in UTF-8, as its header says.  We have provided 
many ways to work with UTF-8 on Windows, but it seems readHTMLTable() 
is not making use of them.

You need to run iconv() on the strings in your object (which as it has 
factors, are the levels).  When you do so, you will discover that page 
contains characters not in your native charset (I presume, not having 
your locale).

What you can do, in Rgui only, is

for (n in names(Islands)) Encoding(levels(Islands[[n]])) <-"UTF-8"

but likely there are still characters it will not know how to display.
On Wed, 4 May 2011, R.T.A.J.Leenders wrote: