Skip to content

Getting htmlParse to work with Hebrew? (on windows)

5 messages · Milan Bouchet-Valat, Lawr Eskin

#
Le jeudi 21 f?vrier 2013 ? 18:31 +0400, Lawr Eskin a ?crit :
And what if you try this:
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))

or this:
a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))


Cheers
1 day later
#
Le jeudi 21 f?vrier 2013 ? 18:53 +0400, Lawr Eskin a ?crit :
This procedure works on Linux, but not on Windows:

library(RCurl)
library(XML)
u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
a <- getURL(u, .encoding="UTF-8")
a <- iconv(a, "windows-1251", "UTF-8")
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
a2

But maybe the problem is more general, and related to conversion between
encodings on Windows. What looks weird to me is that on Windows, I'm not
able to save a character string to a file in UTF-8, despite what ?file
says:
x <- "??? ????? ????????"
Encoding(x)
# UTF-8
cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con)
x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con)
Encoding(x2)
# unknown
x2
# [1] "<U+041A><U+0443>..."

I know the problem happens on write because the file cannot be read
correctly on Linux either.

This Windows machine uses Windows Server 2008 with French_France.1252
locale.