scraping with session cookies
Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = "http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" library(RCurl) curl = getCurlHandle(cookiefile = "", verbose = TRUE) postForm(site, disclaimer_action="I Agree") Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action="I Agree") This helps to abstract the details of the form. D.
On 9/18/12 5:57 PM, CPV wrote:
Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully.... The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site <- " http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=text&stn=05ND012&prm1=3&syr=2012&smo=09&sday=15&eyr=2012&emo=09&eday=18" postForm(site, disclaimer_action="I Agree") cf <- "cookies.txt" no_cookie <- function() { curlHandle <- getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables <- readHTMLTable(site) allTables [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.