RCurl unable to download a particular web page -- what is so special about this web page?
Thank you. The output i get from that example is below:
d = debugGatherer()
getURL("http://uk.youtube.com",
+ debugfunction = d$update, verbose = TRUE ) [1] ""
d$value()
text "About to connect() to uk.youtube.com port 80 (#0)\n Trying 208.117.236.72... connected\nConnected to uk.youtube.com (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com left intact\n" headerIn "HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- Content-Type-Options: nosniff\r\nCache-Control: no-cache\r \nCneonction: close\r\n\r\n" headerOut "GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n" dataIn "0\r\n\r\n" dataOut ""
So the critical information from this is the '400 Bad Request'. A
Google search defines this for me as:
The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications.
looking through sort(both listCurlOptions() and
http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
help me this time (unless i missed something). Any advice?
Thank you for your time,
C.C
P.S. I can get the d/l to work if i use:
toString(readLines("http://www.uk.youtube.com"))
[1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, ,
\t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin:
0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction
testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
\tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
new Array(16), \t\t\t\tbannersizes[0] = [etc]
On 27 Jan, 13:52, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
clair.crossup... at googlemail.com wrote:
Thank you Duncan.
I remember seeing in your documentation that you have used this 'verbose=TRUE' argument in functions before when trying to see what is going on. This is good. However, I have not been able to get it to work for me. Does the output appear in R or do you use some other external window (i.e. MS DOS window?)?
The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.
A more general way however is to use the debugfunction
option.
d = debugGatherer()
getURL("http://uk.youtube.com",
? ? ? ? ?debugfunction = d$update, verbose = TRUE)
When this completes, use
? d$value()
and you have the entire contents that would be displayed on the console.
? D.
library(RCurl) my.url <- 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... getURL(my.url, verbose = TRUE)
[1] ""
I am having a problem with a new webpage (http://uk.youtube.com/) but if i can get this verbose to work, then i think i will be able to google the right action to take based on the information it gives.
Many thanks for your time, C.C.
On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:
clair.crossup... at googlemail.com wrote:
Dear R-help, There seems to be a web page I am unable to download using RCurl. I don't understand why it won't download:
library(RCurl) my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." getURL(my.url)
[1] ""
? I like the irony that RCurl seems to have difficulties downloading an article about R. ?Good thing it is just a matter of additional arguments to getURL() or it would be bad news.
The followlocation parameter defaults to FALSE, so
? ?getURL(my.url, followlocation = TRUE)
gets what you want.
The way I found this ?is
? getURL(my.url, verbose = TRUE)
and take a look at the information being sent from R and received by R from the server.
This gives
* About to connect() towww.nytimes.comport80 (#0) * ? Trying 199.239.136.200... * connected * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) ?> GET /2009/01/07/technology/business-computing/07program.html?_r=2 HTTP/1.1 Host:www.nytimes.com Accept: */*
< HTTP/1.1 301 Moved Permanently < Server: Sun-ONE-Web-Server/6.1 < Date: Mon, 26 Jan 2009 16:10:51 GMT < Content-length: 0 < Content-type: text/html < Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... <
And the 301 is the critical thing here.
? D.
Other web pages are ok to download but this is the first time I have been unable to download a web page using the very nice RCurl package. While i can download the webpage using the RDCOMClient, i would like to understand why it doesn't work as above please?
library(RDCOMClient) my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." ie <- COMCreate("InternetExplorer.Application") txt <- list() ie$Navigate(my.url)
NULL
while(ie[["Busy"]]) Sys.sleep(1) txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] txt
$`http://www.nytimes.com/2009/01/07/technology/business-computing/ 07program.html?_r=2` [1] "Skip to article Try Electronic Edition Log ... Many thanks for your time, C.C Windows Vista, running with administrator privileges.
sessionInfo()
R version 2.8.1 (2008-12-22) i386-pc-mingw32 locale: LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. 1252;LC_MONETARY=English_United Kingdom. 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods base other attached packages: [1] RDCOMClient_0.92-0 RCurl_0.94-0 loaded via a namespace (and not attached): [1] tools_2.8.1
______________________________________________ R-h... at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-h... at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.