Help with function to webscrap
Hai Augusto,
regarding question #3:
You could use the red list API with RCurl and XML packages.
Here is an example:
> require(RCurl)
> require(XML)
> get_IUCN_status <- function(x) {
+ spec <- tolower(x)
+ spec <- gsub(" ", "-", spec)
+ url <- paste("http://api.iucnredlist.org/go/", spec, sep="")
+ get <- getURL(url, followlocation = TRUE)
+ h <- htmlParse(get)
+ status <- xpathSApply(h, '//div[@id ="red_list_category_code"]',
xmlValue)
+ return(status)
+ }
>
> get_IUCN_status("Panthera uncia")
[1] "EN"
For more resources just type 'webscraping R' in your favourite search
engine.
HTH,
Eduard
On 26/06/12 20:57, Augusto Ribas wrote:
Hello.
I'm haveing problems with a function to do webscrap.
I have a long list like this example:
data<-data.frame(especie=c("Rana pipiens","Rana vaillanti","Ctenosaura
similis","Bos taurus"),group=c("sapo","sapo","reptil","mamifero"))
And, as some species names are out of data, i trying to make a
function to check catalogue of life (http://www.catalogueoflife.org/)
and get the current names.
This have some problems, like species name that split, but help as a
first check.
So i made this function to web scrap the data.
Its simple, it search the site, makeing a link with the keywords, then
enter the first link of the list of results produced and get the
accepted name and author, giveing the results as a list.
for example:
sp.check("Rana pipiens")
$sp.aceito
[1] "Lithobates pipiens"
$autor
[1] "Schreber, 1782"
But sometimes the function cannot acess the internet, and give a error.
I'm made this function trying to copy some examples on foruns, but i
have some doubts:
01) How do i supress the readlines() warnings?
02) How can i make the function try again when cannot acess internet,
or just print something like "Cant acess internet", or when i try
something like:
data$check<-NA
for(i in 1:nrow(data)) {
data$check[i]<-sp.check(data$especie[i])
}
the loop dont stop.
I made a short list, but when with 500 or more lines it usually stop
in the middle.
03) Anyone have an example how to scrap http://www.iucnredlist.org/
the status of species, as it does not use the keyword in the link? Is
there any tutorial simple for someone without any background on
programing or computer science?
Well thanks for the attention.
#fun??o sp.check
sp.check<-function(especie) {
#split species name
especie<-as.character(especie)
gen<-strsplit(especie,"\\ ")[[1]][1]
esp<-strsplit(especie,"\\ ")[[1]][2]
#makeing first link
link<-paste("http://www.catalogueoflife.org/col/search/all/key/",gen,"+",esp,"/match/1",sep="")
link <- iconv(link, 'latin1', 'UTF-8')
Encoding(link) <- 'bytes'
#reading table of results
pagina <- readLines(url(link))
n.linhas<-which(pagina%in%" <td class=\"field_header_black\">")
#is there any results?
if(length(n.linhas)>0) {
pag.sp<-strsplit(pagina[n.linhas[1]+1],'\\"')[[1]][2]
#second link
link2 <- paste( "http://www.catalogueoflife.org",pag.sp,sep="")
link2 <- iconv(link2, 'latin1', 'UTF-8')
Encoding(link2) <- 'bytes'
link2
#read
pagina2 <- readLines(url(link2))
#get line of interest
linha2<-grep('(accepted name)',pagina2)
sp.final<-pagina2[linha2]
#get species name
corte1<-strsplit(sp.final,'<i>')[[1]][2]
sp.aceito<-strsplit(corte1,'</i>')[[1]][1]
#get author
corte2<-strsplit(sp.final,'\\(')[[1]][2]
autor<-strsplit(corte2,')')[[1]][1]
}else {
sp.aceito<-c("N?o encontrado")
autor<-c("N?o encontrado")
}
return(list(sp.aceito=sp.aceito,autor=autor))
}
--
Grato
Augusto C. A. Ribas
Site Pessoal: http://augustoribas.heliohost.org
Lattes: http://lattes.cnpq.br/7355685961127056
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology