Skip to content

Encoding problem - I fails to read Hebrew text from online

6 messages · Tal Galili, Matt Shotwell

2 days later
#
Tal, 

It looks like the data you received has HTML special hex characters.
That is, 'ש' is just an ASCII HTML representation of a hex
character. It's not encoded in a special manner.

The trick is to substitute the HTML encoded hex character for its binary
representation, or "decode" the character. I don't know of any R
function that does this, but there are web services, for example:
http://www.hashemian.com/tools/html-url-encode-decode.php

I decoded your file using this service and posted it on my website. You
can see the difference by running:

readLines("http://biostatmatt.com/temp/Hebrew-original", warn=FALSE)

readLines("http://biostatmatt.com/temp/Hebrew-decoded", warn=FALSE)

The second should display the Hebrew characters correctly (it does in my
terminal). The next thing to think about is how to automate this in R
without using the web service... We may need to write an HTMLDecode
function if there isn't one already.

By the way, what's the Hebrew text in English?

Best,
Matt
On Thu, 2010-12-09 at 12:21 -0500, Tal Galili wrote:

  
    
#
Tal, 

OK, let me clarify my understanding. The original and decoded file are
text, encoded by UTF-8. In the original file, there are HTML `entities'
that represent UTF-8 Hebrew characters. In the decoded file, the
entities are converted to UTF-8 characters. The question is how to
convert these entities within R. It's not the same as converting between
character encodings, otherwise iconv() might offer a solution.

I'll have a look around to find a solution, and I hope others will too.
My first idea is to check RCurl, XML, and the related utils::URLdecode.
If there really is no existing solution, I think it might be worthwhile
to look at how PHP and Python do it (and maybe borrow some code :) ).

-Matt
On Thu, 2010-12-09 at 14:27 -0500, Tal Galili wrote: