Skip to content
Prev 315993 / 398513 Next

how to read a website with Chinese Character

On 13-01-23 8:19 PM, Hui Du wrote:
If you look at the first few lines of x you'll see this:

 > head(x)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 
Transitional//EN\"\t\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
[2] "<html xmlns=\"http://www.w3.org/1999/xhtml\">" 

[3] "<head>" 

[4] "<meta http-equiv=\"Content-Type\" content=\"text/html; 
charset=gb2312\" />"

At the end of line 4 it shows "charset=gb2312".  I didn't think that was 
an encoding, but this seems to do the conversion:

y <- iconv(x, "gb2312", "utf-8")
y

(I don't know if that will display properly on your Windows machine; it 
doesn't work on mine, because I don't have the fonts installed.  But it 
does work on my Mac.)

Duncan Murdoch