Skip to content
Prev 294717 / 398502 Next

Scraping a web page.

Hi Keith

 Of course, it doesn't necessarily matter how you get the job done
if it actually works correctly.  But for a general approach,
it is useful to use general tools and can lead to more correct,
more robust, and more maintainable code.

Since htmlParse() in the XML package can both retrieve and parse the HTML document
  doc = htmlParse(the.url)

is much more succinct than using curlPerform().
However, if you want to use RCurl, just use

    txt = getURLContent(the.url)

and  that replaces

  h = basicTextGatherer()
  curlPerform(url = "http://www.omegahat.org/RCurl", writefunction = h$update)
  h$value()


If you have parsed the HTML document, you can find the <a> nodes that have an
href attribute that start with /en/Ships via

  hrefs = unlist(getNodeSet(doc, "//a[starts-with(@href, '/en/Ships')]/@href"))


The result is a character vector and you can extract the relevant substrings with
substring() or gsub() or any wrapper of those functions.

There are many benefits of parsing the HTML, including not falling foul of
"as far as I can tell the the <a> tag is always on it's own line" being not true.

    D.
On 5/15/12 4:06 AM, Keith Weintraub wrote: