RCurl unable to download a particular web page -- what is so special about this web page?
Thank you Duncan. I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)?
library(RCurl) my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=2' getURL(my.url, verbose = TRUE)
[1] ""
I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives. Many thanks for your time, C.C.
On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
clair.crossup... at googlemail.com wrote:
Dear R-help,
There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download:
library(RCurl) my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." getURL(my.url)
[1] ""
? I like the irony that RCurl seems to have difficulties downloading an article about R. ?Good thing it is just a matter of additional arguments to getURL() or it would be bad news. The followlocation parameter defaults to FALSE, so ? ?getURL(my.url, followlocation = TRUE) gets what you want. The way I found this ?is ? getURL(my.url, verbose = TRUE) and take a look at the information being sent from R and received by R from the server. This gives * About to connect() towww.nytimes.comport 80 (#0) * ? Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) ?> GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */* < HTTP/1.1 301 Moved Permanently < Server: Sun-ONE-Web-Server/6.1 < Date: Mon, 26 Jan 2009 16:10:51 GMT < Content-length: 0 < Content-type: text/html < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... < And the 301 is the critical thing here. ? D.
Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please?
library(RDCOMClient) my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." ie <- COMCreate("InternetExplorer.Application") txt <- list() ie$Navigate(my.url)
NULL
while(ie[["Busy"]]) Sys.sleep(1) txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] txt
$`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] "Skip to article Try Electronic Edition Log ...
Many thanks for your time, C.C
Windows Vista, running with administrator privileges.
sessionInfo()
R version 2.8.1 (2008-12-22) i386-pc-mingw32
locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages: [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods base
other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0
loaded via a namespace (and not attached): [1] tools_2.8.1
______________________________________________ R-h... at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.