Encoding problem - I fails to read Hebrew text from online - R-help

Tue, Dec 7, 2010 4:30 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101207/82cf83d7/attachment.pl>

Tal Galili

Thu, Dec 9, 2010 9:21 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101209/fa58fe74/attachment.pl>

Matt Shotwell

Thu, Dec 9, 2010 10:38 AM #

Tal, 

It looks like the data you received has HTML special hex characters.
That is, '&#x5E9;' is just an ASCII HTML representation of a hex
character. It's not encoded in a special manner.

The trick is to substitute the HTML encoded hex character for its binary
representation, or "decode" the character. I don't know of any R
function that does this, but there are web services, for example:
http://www.hashemian.com/tools/html-url-encode-decode.php

I decoded your file using this service and posted it on my website. You
can see the difference by running:

readLines("http://biostatmatt.com/temp/Hebrew-original", warn=FALSE)

readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE)

The second should display the Hebrew characters correctly (it does in my
terminal). The next thing to think about is how to automate this in R
without using the web service... We may need to write an HTMLDecode
function if there isn't one already.

By the way, what's the Hebrew text in English?

Best,
Matt

On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:

Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina

Tal Galili

Thu, Dec 9, 2010 11:27 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101209/b2e9e303/attachment.pl>

Matt Shotwell

Thu, Dec 9, 2010 2:00 PM #

Tal, 

OK, let me clarify my understanding. The original and decoded file are
text, encoded by UTF-8. In the original file, there are HTML `entities'
that represent UTF-8 Hebrew characters. In the decoded file, the
entities are converted to UTF-8 characters. The question is how to
convert these entities within R. It's not the same as converting between
character encodings, otherwise iconv() might offer a solution.

I'll have a look around to find a solution, and I hope others will too.
My first idea is to check RCurl, XML, and the related utils::URLdecode.
If there really is no existing solution, I think it might be worthwhile
to look at how PHP and Python do it (and maybe borrow some code :) ).

-Matt

On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote:

Hi Matt,
Thanks for having a look at this.
I just spent some time looking around and couldn't find any R function
to decode  decimal HTML code.


Do you (or someone else on the list) knows how to program this sort of
thing? (is there a formula for the translation?




p.s:
For it to work on my end I added the encoding parameter:
readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE,
encoding= "UTF-8")


p.p.s: The Hebrew word I used means "peace" 


Cheers,
Tal


----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew)
| www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Thu, Dec 9, 2010 at 8:38 PM, Matt Shotwell <shotwelm at musc.edu>
wrote:
        Tal,
        
        It looks like the data you received has HTML special hex
        characters.
        That is, '&#x5E9;' is just an ASCII HTML representation of a
        hex
        character. It's not encoded in a special manner.
        
        The trick is to substitute the HTML encoded hex character for
        its binary
        representation, or "decode" the character. I don't know of any
        R
        function that does this, but there are web services, for
        example:
        http://www.hashemian.com/tools/html-url-encode-decode.php
        
        I decoded your file using this service and posted it on my
        website. You
        can see the difference by running:
        
        readLines("http://biostatmatt.com/temp/Hebrew-original",
        warn=FALSE)
        
        readLines("http://biostatmatt.com/temp/Hebrew-decoded",
        warn=FALSE)
        
        The second should display the Hebrew characters correctly (it
        does in my
        terminal). The next thing to think about is how to automate
        this in R
        without using the web service... We may need to write an
        HTMLDecode
        function if there isn't one already.
        
        By the way, what's the Hebrew text in English?
        
        Best,
        Matt
        
        
        
        On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:

        > I am bumping this question in the hopes that someone might

        be able to

        > advise.
        > This Hebrew and R business is not as smooth as I had

        hoped...

        >
        > Thanks,
        > Tal
        >
        > Older massage:
        >
        > On Tue, Dec 7, 2010 at 2:30 PM, Tal Galili

        <tal.galili at gmail.com> wrote:

        > > Hello all,
        > >
        > > # I am trying to read the text in this URL:
        > > u <-
        > > http://google.com/complete/search?output=toolbar&q=%d7%a9%

        d7%9c%d7%95%d7%9d

        > > # By using this command:
        > > readLines(u)
        > >
        > > And no matter what variation I tried, I keep getting this

        output:

        > > [1] "<?xml version=\"1.0

        \"?><toplevel><CompleteSuggestion><suggestion

        > > data=\"&#x5E9;&#x5DC;&#x5D5;&#x5DD;\"/><   (etc...)
        > >

        >
        >

        > > Instead of this output:
        > > <?xml

        version="1.0"?><toplevel><CompleteSuggestion><suggestion
        data="????

        > > "/><num_queries

        int="16800000"/></CompleteSuggestion><CompleteSuggestion><suggestion

        > > data="???? ????"/><num_queries

        int="232000"/></CompleteSuggestion>

        > > <CompleteSuggestion><suggestion data="???? ?????"/
        > > (etc....)
        > >
        > >

        > > I tried:
        > >   readLines(u, encoding= "latin1")
        > >   readLines(u, encoding= "UTF-8")
        > > And also changing Sys.setlocale:
        > >   Sys.setlocale("LC_ALL", "Hebrew") # must be done for

        Hebrew to work.

        > >   Sys.setlocale("LC_ALL", "English") # must be done for

        Hebrew to work.

        > >
        > > Are there any more options I could try to get this text

        properly encoded?

        > >
        > > Thanks!
        > > Tal
        > >
        > >
        > >
        > > ----------------Contact
        > >

        Details:-------------------------------------------------------

        > > Contact me: Tal.Galili at gmail.com |  972-52-7275845
        > > Read me: www.talgalili.com (Hebrew) |

        www.biostatistics.co.il (Hebrew) |

        > > www.r-statistics.com (English)
        > >
        > >

        ----------------------------------------------------------------------------------------------

        > >
        > >
        > >

        >       [[alternative HTML version deleted]]
        >

        
        --
        Matthew S. Shotwell
        Graduate Student
        Division of Biostatistics and Epidemiology
        Medical University of South Carolina

Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina

Tal Galili

Fri, Dec 10, 2010 12:34 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20101210/b4217f97/attachment.pl>