Skip to content

scraping with session cookies

8 messages · CPV, Duncan Temple Lang, Heramb Gadgil

#
Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
"http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

postForm(site, disclaimer_action="I Agree")

Now we have the cookie in the curl handle so we can use that same curl handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action="I Agree")

This helps to abstract the details of the form.

  D.
On 9/18/12 5:57 PM, CPV wrote:
#
You don't need to use the  getHTMLFormDescription() and createFunction().
Instead, you can use the postForm() call.  However, getHTMLFormDescription(),
etc. is more general. But you need the very latest version of the package
to deal with degenerate forms that have no inputs (other than button clicks).

 You can get the latest version of the RHTMLForms package
 from github

      git clone git at github.com:omegahat/RHTMLForms.git

 and that has the fixes for handling the degenerate forms with
 no arguments.

   D.
On 9/19/12 7:51 AM, CPV wrote:
1 day later
1 day later