Skip to content
Prev 305827 / 398506 Next

scraping with session cookies

Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
"http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

postForm(site, disclaimer_action="I Agree")

Now we have the cookie in the curl handle so we can use that same curl handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action="I Agree")

This helps to abstract the details of the form.

  D.
On 9/18/12 5:57 PM, CPV wrote: