scraping with session cookies

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120918/9ebdbdb6/attachment.pl>
Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
"http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

postForm(site, disclaimer_action="I Agree")

Now we have the cookie in the curl handle so we can use that same curl handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)

Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action="I Agree")

This helps to abstract the details of the form.

  D.
Hi, I am starting coding in r and one of the things that i want to do is to
scrape some data from the web.
The problem that I am having is that I cannot get passed the disclaimer
page (which produces a session cookie). I have been able to collect some
ideas and combine them in the code below but I dont get passed the
disclaimer page.
I am trying to agree the disclaimer with the postForm and write the cookie
to a file, but I cannot do it succesfully....
The webpage cookies are written to the file but the value is FALSE... So
any ideas of what I should do or what I am doing wrong with?
Thank you for your help,

library(RCurl)
library(XML)

site <- "
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18"

postForm(site, disclaimer_action="I Agree")

cf <- "cookies.txt"

no_cookie <- function() {
        curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf)
        getURL(site, curl=curlHandle)

        rm(curlHandle)
        gc()
}

if ( file.exists(cf) == TRUE ) {
        file.create(cf)
        no_cookie()
}
allTables <- readHTMLTable(site)
allTables

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120919/faa9a101/attachment.pl>
You don't need to use the  getHTMLFormDescription() and createFunction().
Instead, you can use the postForm() call.  However, getHTMLFormDescription(),
etc. is more general. But you need the very latest version of the package
to deal with degenerate forms that have no inputs (other than button clicks).

 You can get the latest version of the RHTMLForms package
 from github

      git clone git at github.com:omegahat/RHTMLForms.git

 and that has the fixes for handling the degenerate forms with
 no arguments.

   D.
Thank you for your help Duncan,

I have been trying what you suggested however  I am getting an error when
trying to create the function fun<- createFunction(forms[[1]])
it says Error in isHidden I hasDefault :
operations are possible only for numeric, logical or complex types

On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang <
dtemplelang at ucdavis.edu> wrote:

Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
"
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18
"

library(RCurl)
curl = getCurlHandle(cookiefile = "", verbose = TRUE)

postForm(site, disclaimer_action="I Agree")

Now we have the cookie in the curl handle so we can use that same curl
handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)

Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action="I Agree")

This helps to abstract the details of the form.

  D.

On 9/18/12 5:57 PM, CPV wrote:
Hi, I am starting coding in r and one of the things that i want to do is
to
scrape some data from the web.
The problem that I am having is that I cannot get passed the disclaimer
page (which produces a session cookie). I have been able to collect some
ideas and combine them in the code below but I dont get passed the
disclaimer page.
I am trying to agree the disclaimer with the postForm and write the
cookie
to a file, but I cannot do it succesfully....
The webpage cookies are written to the file but the value is FALSE... So
any ideas of what I should do or what I am doing wrong with?
Thank you for your help,

library(RCurl)
library(XML)

site <- "

http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18
"
postForm(site, disclaimer_action="I Agree")

cf <- "cookies.txt"

no_cookie <- function() {
        curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf)
        getURL(site, curl=curlHandle)

        rm(curlHandle)
        gc()
}

if ( file.exists(cf) == TRUE ) {
        file.create(cf)
        no_cookie()
}
allTables <- readHTMLTable(site)
allTables

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120919/14c9b3fe/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120920/fd2f30c2/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120921/fedc69c0/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120923/4eab24a3/attachment.pl>