RCurl unable to download a particular web page -- what is so special about this web page?

Tue, Jan 27, 2009 4:25 AM

Thank you Duncan.

I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

[1] ""

I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.

Many thanks for your time,
C.C.

On 26 Jan, 16:12, Duncan Temple Lang <dun... at wald.ucdavis.edu> wrote:

clair.crossup... at googlemail.com wrote:

Dear R-help,

There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:

library(RCurl)
my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
getURL(my.url)

[1] ""

? I like the irony that RCurl seems to have difficulties downloading an
article about R. ?Good thing it is just a matter of additional arguments
to getURL() or it would be bad news.

The followlocation parameter defaults to FALSE, so

? ?getURL(my.url, followlocation = TRUE)

gets what you want.

The way I found this ?is

? getURL(my.url, verbose = TRUE)

and take a look at the information being sent from R
and received by R from the server.

This gives

* About to connect() towww.nytimes.comport 80 (#0)
* ? Trying 199.239.136.200... * connected
* Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
?> GET /2009/01/07/technology/business-computing/07program.html?_r=2
HTTP/1.1
Host:www.nytimes.com
Accept: */*

< HTTP/1.1 301 Moved Permanently
< Server: Sun-ONE-Web-Server/6.1
< Date: Mon, 26 Jan 2009 16:10:51 GMT
< Content-length: 0
< Content-type: text/html
< Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
<

And the 301 is the critical thing here.

? D.

Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?

library(RDCOMClient)
my.url <- "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..."
ie <- COMCreate("InternetExplorer.Application")
txt <- list()
ie$Navigate(my.url)

NULL

while(ie[["Busy"]]) Sys.sleep(1)
txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
txt

$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] "Skip to article Try Electronic Edition Log ...

Many thanks for your time,
C.C

Windows Vista, running with administrator privileges.

sessionInfo()

R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods
base

other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0

loaded via a namespace (and not attached):
[1] tools_2.8.1

______________________________________________
R-h... at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

RCurl unable to download a particular web page -- what is so special about this web page?

Thread (11 messages)