An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120918/9ebdbdb6/attachment.pl>
scraping with session cookies
8 messages · CPV, Duncan Temple Lang, Heramb Gadgil
Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = "http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE) postForm(site, disclaimer_action="I Agree") Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action="I Agree") This helps to abstract the details of the form. D.
On 9/18/12 5:57 PM, CPV wrote:
Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully.... The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site <- " http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" postForm(site, disclaimer_action="I Agree") cf <- "cookies.txt" no_cookie <- function() { curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables <- readHTMLTable(site) allTables [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120919/faa9a101/attachment.pl>
You don't need to use the getHTMLFormDescription() and createFunction().
Instead, you can use the postForm() call. However, getHTMLFormDescription(),
etc. is more general. But you need the very latest version of the package
to deal with degenerate forms that have no inputs (other than button clicks).
You can get the latest version of the RHTMLForms package
from github
git clone git at github.com:omegahat/RHTMLForms.git
and that has the fixes for handling the degenerate forms with
no arguments.
D.
On 9/19/12 7:51 AM, CPV wrote:
Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun<- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang < dtemplelang at ucdavis.edu> wrote:
Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = " http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18 " library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE) postForm(site, disclaimer_action="I Agree") Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action="I Agree") This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote:
Hi, I am starting coding in r and one of the things that i want to do is
to
scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the
cookie
to a file, but I cannot do it succesfully.... The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site <- "
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18 "
postForm(site, disclaimer_action="I Agree")
cf <- "cookies.txt"
no_cookie <- function() {
curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf)
getURL(site, curl=curlHandle)
rm(curlHandle)
gc()
}
if ( file.exists(cf) == TRUE ) {
file.create(cf)
no_cookie()
}
allTables <- readHTMLTable(site)
allTables
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120919/14c9b3fe/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120920/fd2f30c2/attachment.pl>
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120921/fedc69c0/attachment.pl>
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120923/4eab24a3/attachment.pl>