Skip to content

Chinese characters encoding problem with XML

2 messages · Wind

#
XML is a good tool reading data from web within R.  But I wonder how could get the encoding correctly.

library(XML)
url <- 'http://www.szitic.com/docc/jz-lmzq.html'
xml <- htmlTreeParse(url, useInternal=TRUE)
q <- "//tbody/tr/td"
dat <- unlist(xpathApply(xml, q, xmlValue))
df <- as.data.frame(t(matrix(dat, 4)))
dt<-as.character(df[15,1])

The first column of df is dates in Chinese.   dt is one of the Chinese dates.
When I copied the content of dt into the email, it become the following:
[1] "2008&#229;?G????&#13312;&#12544;&#12800;?Z?d\x8825&#230;??????&#13568;&#8704;&#3328;&#2560;&#15872;&#8192;

Indeed in R,  it looks like:
[1] "2008\345\271\xb412\346\234\x8825\346\227\xa5"

and the color of the numbers differs a little.
[1] "native.enc"
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
Package:              XML
Version:              1.98-1
Date:                 2008/10/17

R version 2.8.0 (2008-10-20)
Windows Vista Basic, Simplified Chinese edition.

There is no problem using Chinese characters in R codes.

I wonder how could get the Chinese characters with XML.   Or is there any methods which could help me convert the encoding of characters from UTF-8 to unicode in R?

Regards,
Wind

------------------
http://windspeedo.spaces.live.com
#
Problems focused on XML methods.
xml is OK.  And the heading of xml as following:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html><head><meta http-equiv="Content-Type" content="text/html;
charset=gb2312"><title>????</title>

There is correct charset=gb2312, which is also the content of the web page.
<head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8"><title>??????</title>

The charset has been changed to UTF-8.
<head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8"><title>??????</title>

It seems that some methods of XML will change the charset to UTF-8 on their
own will.
Wind2 wrote: