En indlejret tekst med ukendt tegns?t er blevet fjernet... Navn: ikke tilgængelig Url: <https://stat.ethz.ch/pipermail/r-help/attachments/20120812/0e93858b/attachment.pl>
Web scrabing - getURL with delay
2 messages · Kasper Christensen, Jeff Newmiller
Perhaps ?Sys.sleep between scrapes. If this slows things down too much you may be able to parallelize by host site with ?mclapply.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
Kasper Christensen <kasper2304 at gmail.com> wrote:
Hi R people.
Im currently trying to construct a piece of R-code that can retrieve a
list
of webpages I have stored as a csv file and save the content of the
webpages into separate txt files. I want to retrieve a total number of
6000
threads posted at a forum, to try to build/train a classifier that can
tell
me if the thread contains valuable information.
*Until now* I have managed to get the following code to work:
*> library(foreign)*
*> library(RCurl)*
*Indl?ser kr?vet pakke: bitops*
*> addresses <- read.csv("~/Extract post - forum.csv")*
*> for (i in addresses) full.text <- getURL(i) *
*> text.sub <- gsub("<.+?>", "", full.text)*
*> text <- data.frame(text.sub)*
*> outpath <-"~/forum - RawData"*
*> x <- 1:nrow(text)*
*> for(i in x) {*
*+ write(as.character(text[i,1]), file =
paste(outpath,"/",i,".txt",sep=""))
*
*+ }*
*> *
*
*
(I have both mac iOS and Windows)
This piece of code is not my own work and I therefore send a warm thank
you
to Christopher Gandrud and co authors for providing this piece of code.
*The problem*
The code works like a charm looking up all the different addresses I
have
stored in my csv file. The csv file I constructed as:
*Link*
*"webadress 1"*
*"webadress 2"*
*"webadress n"*
*
*
The problem is that i get empty output files and files saying "Server
overloaded". However I do also get files that contains the information
intended. The pattern of "bad" and "good" files a different from each
time
i run the code with total n, telling me that it is not the code that is
the
problem. No need to say it is probably my many request that is causing
the
overload and as I am pretty new in the area I did not believe that this
would be a problem. When realizing that this WAS a problem I tried
reducing
the number of requests to 100 at a time, which gave me all text files
containing the info I wanted.
Therefore I am looking for some kind of solution to this problem, and
my
own best solution would be to build something into the code that makes
it
send x number of request with a z given interval (5 seconds maybe),
until I
have retrieved the total n of webpages in the csv file. If it fails to
retrieve a webpage it would be nice to sort the "bad" text files into a
"redo" folder which could then be run afterwards.
Any type of solution is welcome. As said I am pretty new with r-coding
but
i some coding experience with VBA.
Best
Kasper
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.