XML package example code?

I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?
I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

Did you try looking through the help pages for the XML package or browsing
the Omegahat website?

Look at:

  library(XML)
  ?htmlTreeParse

And the relevant web page for documentation and examples is:

  http://www.omegahat.org/RSXML/

-Charlie

-----
Charlie Sharpsteen
Undergraduate
Environmental Resources Engineering
Humboldt State University
View this message in context: http://old.nabble.com/XML-package-example-code--tp26506445p26508065.html
Sent from the R help mailing list archive at Nabble.com.
Cls59 is correct that there is a lot of example code, just look in ?
htmlTreeParse and you'll get most of what you need i think.

here's some simplified code I use a lot of (XPath expressions are used
to parse the code):

# libraries
library(RCurl)
library(XML)

# google url
my.url <- "http://www.google.co.uk/search?hl=en&client=firefox-
a&rls=org.mozilla%3Aen-GB%3Aofficial&hs=6Sd&q=google
+wave&btnG=Search&meta=&aq=f&oq="

# download page
html <- getURL(my.url)
html.tree <- htmlTreeParse(html, useInternalNodes = TRUE, error =
function(...){})

# the xpath expression is next
nodes <- getNodeSet(html.tree, "//a[@href][@class='l']")
links <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])

HTH
Tony
I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

______________________________________________
R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Peng Yu wrote:
I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

Did you try looking through the help pages for the XML package or browsing
the Omegahat website?

Look at:

?library(XML)
??htmlTreeParse

And the relevant web page for documentation and examples is:

?http://www.omegahat.org/RSXML/
http://www.omegahat.org/RSXML/shortIntro.html

I'm trying the example on the above webpage. But I'm not sure why I
got the following error. Would you help to take a look?

$ Rscript main.R
library(XML)

download.file('http://www.omegahat.org/RSXML/index.html','index.html')
trying URL 'http://www.omegahat.org/RSXML/index.html'
Content type 'text/html; charset=ISO-8859-1' length 3021 bytes
opened URL
==================================================
downloaded 3021 bytes
doc = xmlInternalTreeParse("index.html")
Opening and ending tag mismatch: dd line 68 and dl
Opening and ending tag mismatch: li line 67 and body
Opening and ending tag mismatch: dt line 66 and html
Premature end of data in tag dd line 64
Premature end of data in tag li line 63
Premature end of data in tag dt line 62
Premature end of data in tag dl line 61
Premature end of data in tag body line 5
Premature end of data in tag html line 1
Error: 1: Opening and ending tag mismatch: dd line 68 and dl
2: Opening and ending tag mismatch: li line 67 and body
3: Opening and ending tag mismatch: dt line 66 and html
4: Premature end of data in tag dd line 64
5: Premature end of data in tag li line 63
6: Premature end of data in tag dt line 62
7: Premature end of data in tag dl line 61
8: Premature end of data in tag body line 5
9: Premature end of data in tag html line 1
Execution halted
It's been a long time since i read the tutorials, but 'I think', the
reason you get those notifications is because the html code is
malformed, meaning that some of the opening tags '<dd>' don't have
corresponding end tags </dd> etc.

The XML package seems rather good at working with malformed code, and
therefore I usually just force those notifications into an empty
function.

library(RCurl)
library(XML)
html <- getURL("http://www.omegahat.org/RSXML/index.html")
html.tree <- htmlTreeParse(html, useInternalNodes = TRUE, error =
function(...){})

HTH,
Tony Breyal
On Wed, Nov 25, 2009 at 12:19 AM, cls59 <ch... at sharpsteen.net> wrote:

Peng Yu wrote:

I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

Did you try looking through the help pages for the XML package or browsing
the Omegahat website?

Look at:

?library(XML)
??htmlTreeParse

And the relevant web page for documentation and examples is:

?http://www.omegahat.org/RSXML/
http://www.omegahat.org/RSXML/shortIntro.html

I'm trying the example on the above webpage. But I'm not sure why I
got the following error. Would you help to take a look?

$ Rscript main.R> library(XML)

download.file('http://www.omegahat.org/RSXML/index.html','index.html')
trying URL 'http://www.omegahat.org/RSXML/index.html'
Content type 'text/html; charset=ISO-8859-1' length 3021 bytes
opened URL
==================================================
downloaded 3021 bytes

doc = xmlInternalTreeParse("index.html")
Opening and ending tag mismatch: dd line 68 and dl
Opening and ending tag mismatch: li line 67 and body
Opening and ending tag mismatch: dt line 66 and html
Premature end of data in tag dd line 64
Premature end of data in tag li line 63
Premature end of data in tag dt line 62
Premature end of data in tag dl line 61
Premature end of data in tag body line 5
Premature end of data in tag html line 1
Error: 1: Opening and ending tag mismatch: dd line 68 and dl
2: Opening and ending tag mismatch: li line 67 and body
3: Opening and ending tag mismatch: dt line 66 and html
4: Premature end of data in tag dd line 64
5: Premature end of data in tag li line 63
6: Premature end of data in tag dt line 62
7: Premature end of data in tag dl line 61
8: Premature end of data in tag body line 5
9: Premature end of data in tag html line 1
Execution halted

______________________________________________
R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Not sure if my code was attached in that last post:

library(RCurl)
library(XML)
html <- getURL("http://www.omegahat.org/RSXML/index.html")
html.tree <- htmlTreeParse(html, useInternalNodes = TRUE, error =
function(...){})
On Wed, Nov 25, 2009 at 12:19 AM, cls59 <ch... at sharpsteen.net> wrote:

Peng Yu wrote:

I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

Did you try looking through the help pages for the XML package or browsing
the Omegahat website?

Look at:

?library(XML)
??htmlTreeParse

And the relevant web page for documentation and examples is:

?http://www.omegahat.org/RSXML/
http://www.omegahat.org/RSXML/shortIntro.html

I'm trying the example on the above webpage. But I'm not sure why I
got the following error. Would you help to take a look?

$ Rscript main.R> library(XML)

download.file('http://www.omegahat.org/RSXML/index.html','index.html')
trying URL 'http://www.omegahat.org/RSXML/index.html'
Content type 'text/html; charset=ISO-8859-1' length 3021 bytes
opened URL
==================================================
downloaded 3021 bytes

doc = xmlInternalTreeParse("index.html")
Opening and ending tag mismatch: dd line 68 and dl
Opening and ending tag mismatch: li line 67 and body
Opening and ending tag mismatch: dt line 66 and html
Premature end of data in tag dd line 64
Premature end of data in tag li line 63
Premature end of data in tag dt line 62
Premature end of data in tag dl line 61
Premature end of data in tag body line 5
Premature end of data in tag html line 1
Error: 1: Opening and ending tag mismatch: dd line 68 and dl
2: Opening and ending tag mismatch: li line 67 and body
3: Opening and ending tag mismatch: dt line 66 and html
4: Premature end of data in tag dd line 64
5: Premature end of data in tag li line 63
6: Premature end of data in tag dt line 62
7: Premature end of data in tag dl line 61
8: Premature end of data in tag body line 5
9: Premature end of data in tag html line 1
Execution halted

______________________________________________
R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
On Wed, Nov 25, 2009 at 12:19 AM, cls59 <chuck at sharpsteen.net> wrote:
Peng Yu wrote:
I'm interested in parsing an html page. I should use XML, right? Could
you somebody show me some example code? Is there a tutorial for this
package?

Did you try looking through the help pages for the XML package or browsing
the Omegahat website?

Look at:

 library(XML)
 ?htmlTreeParse

And the relevant web page for documentation and examples is:

 http://www.omegahat.org/RSXML/

http://www.omegahat.org/RSXML/shortIntro.html

I'm trying the example on the above webpage. But I'm not sure why I
got the following error. Would you help to take a look?

$ Rscript main.R
library(XML)

download.file('http://www.omegahat.org/RSXML/index.html','index.html')
trying URL 'http://www.omegahat.org/RSXML/index.html'
Content type 'text/html; charset=ISO-8859-1' length 3021 bytes
opened URL
==================================================
downloaded 3021 bytes

doc = xmlInternalTreeParse("index.html")
You are trying to parse an HTML document as if it were XML.
But HTML is often not well-formed.  So use htmlParse()
for a more forgiving parser.

Or use the RTidyHTML package (www.omegahat.org/RTidyHTML)
to make the HTML well-formed before passing it to xmlTreeParse()
(aka xmlInternalTreeParse()). That package is an interface to
libtidy.

 D.
Opening and ending tag mismatch: dd line 68 and dl
Opening and ending tag mismatch: li line 67 and body
Opening and ending tag mismatch: dt line 66 and html
Premature end of data in tag dd line 64
Premature end of data in tag li line 63
Premature end of data in tag dt line 62
Premature end of data in tag dl line 61
Premature end of data in tag body line 5
Premature end of data in tag html line 1
Error: 1: Opening and ending tag mismatch: dd line 68 and dl
2: Opening and ending tag mismatch: li line 67 and body
3: Opening and ending tag mismatch: dt line 66 and html
4: Premature end of data in tag dd line 64
5: Premature end of data in tag li line 63
6: Premature end of data in tag dt line 62
7: Premature end of data in tag dl line 61
8: Premature end of data in tag body line 5
9: Premature end of data in tag html line 1
Execution halted

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.