An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130221/1ee448d0/attachment.pl>
Getting htmlParse to work with Hebrew? (on windows)
5 messages · Milan Bouchet-Valat, Lawr Eskin
Le jeudi 21 f?vrier 2013 ? 18:31 +0400, Lawr Eskin a ?crit :
Hi Milan, a <- getURL(con, .encoding = "UTF-8") Encoding(a)
[1] "UTF-8"
a # Here - the UTF-8 codes looks like fine. htmlParse(a, encoding = "UTF-8") ###again same encoding issue
And what if you try this:
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
or this:
a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
Cheers
why didn't getURL() detect and set a's encoding correctly?
I think there are page issue because another sites works fine
2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
Le jeudi 21 f?vrier 2013 ? 16:04 +0400, Lawr Eskin a ?crit :
> Hi Milan!
>
>
> > Encoding(a)
> [1] "unknown"
Hm, here I get "UTF-8", which is my locale encoding.
I've tried a little more, and I discovered that using
a <- getURL(u, .encoding="UTF-8")
ensures that a is in the correct encoding here. I know this is
not your
problem, but it might help: check whether Encoding(a) is set
to "UTF-8"
or not in that case, and whether this fixes things.
I'm not sure how htmlParse() detects the encoding when you
pass it a
character vector, but it probably uses Encoding(a), since
that's the
only reliable information; if it is missing, maybe it falls
back to what
the contents of the file say (maybe even before what the
"encoding"
argument says), which is windows-1251, and may not be the
encoding in
which getURL() saved the character vector. The question would
then be:
why didn't getURL() detect and set a's encoding correctly?
My two cents
> 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> Le jeudi 21 f?vrier 2013 ? 13:16 +0400, Lawr Eskin a
?crit :
> > Hello dear R-help mailing list.
> >
> >
> > Looks like the same issue in Russian:
> >
> >
> >
> > library(RCurl)
> >
> > library(XML)
> >
> > u = "
>
http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
> >
> > a = getURL(u)
> >
> > a # Here - the Russian is fine.
> >
> > a2 <- htmlParse(a)
> >
> > a2 # Here it is a mess...
> >
> >
> >
> > None of these seem to fix it:
> >
> >
> >
> > htmlParse(a, encoding = "windows-1251")
> >
> > htmlParse(a, encoding = "CP1251")
> >
> > htmlParse(a, encoding = "cp1251")
> >
> > htmlParse(a, encoding = "iso8859-5")
> >
> >
> >
> > This is my locale:
> >
> >
> >
> > Sys.getlocale()
> >
> >
>
"LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> >
> >
> >
> > Any suggestions?
>
> What does Encoding(a) say?
>
>
> (FWIW, here on Linux even a is not in the correct
encoding :
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN"
> "http://www.w3.org/TR/REC-html40/loose.dtd"> > <html><head> > <title>???????????? ????????????????? ???????? ?????
????????
> ?? ?? ?????
> ??????? ?? 11430 ???????????????????? ?? ?????????
???? ??????
> ??????????
> ? ???????? ????? ????????</title>
> [...])
>
>
> Regards
>
>
> > Thanks you very much in advance,
> >
> > Lavrentiy Eskin
>
> > <http://www.eng.nvg.ru> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide
> > and provide commented, minimal, self-contained,
reproducible
> code.
>
>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130221/85917d96/attachment.pl>
1 day later
Le jeudi 21 f?vrier 2013 ? 18:53 +0400, Lawr Eskin a ?crit :
iconv trued before in various try, same issue and result with encoding = unknown now try sub - same issue
This procedure works on Linux, but not on Windows: library(RCurl) library(XML) u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" a <- getURL(u, .encoding="UTF-8") a <- iconv(a, "windows-1251", "UTF-8") a2 <- htmlParse(sub("windows-1251", "UTF-8", a)) a2 But maybe the problem is more general, and related to conversion between encodings on Windows. What looks weird to me is that on Windows, I'm not able to save a character string to a file in UTF-8, despite what ?file says: x <- "??? ????? ????????" Encoding(x) # UTF-8 cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con) x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con) Encoding(x2) # unknown x2 # [1] "<U+041A><U+0443>..." I know the problem happens on write because the file cannot be read correctly on Linux either. This Windows machine uses Windows Server 2008 with French_France.1252 locale.
2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
Le jeudi 21 f?vrier 2013 ? 18:31 +0400, Lawr Eskin a ?crit :
> Hi Milan,
>
> a <- getURL(con, .encoding = "UTF-8")
> Encoding(a)
> > [1] "UTF-8"
> a # Here - the UTF-8 codes looks like fine.
> htmlParse(a, encoding = "UTF-8") ###again same encoding
issue
And what if you try this:
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
or this:
a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
Cheers
> >>why didn't getURL() detect and set a's encoding correctly?
> I think there are page issue because another sites works
fine
>
> 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> Le jeudi 21 f?vrier 2013 ? 16:04 +0400, Lawr Eskin a
?crit :
> > Hi Milan!
> >
> >
> > > Encoding(a)
> > [1] "unknown"
>
> Hm, here I get "UTF-8", which is my locale encoding.
>
> I've tried a little more, and I discovered that
using
> a <- getURL(u, .encoding="UTF-8")
> ensures that a is in the correct encoding here. I
know this is
> not your
> problem, but it might help: check whether
Encoding(a) is set
> to "UTF-8"
> or not in that case, and whether this fixes things.
>
> I'm not sure how htmlParse() detects the encoding
when you
> pass it a
> character vector, but it probably uses Encoding(a),
since
> that's the
> only reliable information; if it is missing, maybe
it falls
> back to what
> the contents of the file say (maybe even before what
the
> "encoding"
> argument says), which is windows-1251, and may not
be the
> encoding in
> which getURL() saved the character vector. The
question would
> then be:
> why didn't getURL() detect and set a's encoding
correctly?
>
>
> My two cents
>
>
> > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> > Le jeudi 21 f?vrier 2013 ? 13:16 +0400,
Lawr Eskin a
> ?crit :
> > > Hello dear R-help mailing list.
> > >
> > >
> > > Looks like the same issue in Russian:
> > >
> > >
> > >
> > > library(RCurl)
> > >
> > > library(XML)
> > >
> > > u = "
> >
>
http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
> > >
> > > a = getURL(u)
> > >
> > > a # Here - the Russian is fine.
> > >
> > > a2 <- htmlParse(a)
> > >
> > > a2 # Here it is a mess...
> > >
> > >
> > >
> > > None of these seem to fix it:
> > >
> > >
> > >
> > > htmlParse(a, encoding = "windows-1251")
> > >
> > > htmlParse(a, encoding = "CP1251")
> > >
> > > htmlParse(a, encoding = "cp1251")
> > >
> > > htmlParse(a, encoding = "iso8859-5")
> > >
> > >
> > >
> > > This is my locale:
> > >
> > >
> > >
> > > Sys.getlocale()
> > >
> > >
> >
>
"LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> > >
> > >
> > >
> > > Any suggestions?
> >
> > What does Encoding(a) say?
> >
> >
> > (FWIW, here on Linux even a is not in the
correct
> encoding :
> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML
4.0
> Transitional//EN"
> >
> > <html><head>
> > <title>???????????? ????????????????? ????
???? ?????
> ????????
> > ?? ?? ?????
> > ??????? ?? 11430 ???????????????????? ??
?????????
> ???? ??????
> > ??????????
> > ? ???????? ????? ????????</title>
> > [...])
> >
> >
> > Regards
> >
> >
> > > Thanks you very much in advance,
> > >
> > > Lavrentiy Eskin
> >
> > > <http://www.eng.nvg.ru> > > > > > > [[alternative HTML version
deleted]]
> > >
> > >
______________________________________________
> > > R-help at r-project.org mailing list
> > >
https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> >
http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal,
self-contained,
> reproducible
> > code.
> >
> >
>
>
>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130223/367b98df/attachment.pl>