regexp problem (was: Re: publication statistics from Web of Science)
Whoops, it seems I could use some help with regular expressions... Consider the following two functions, creating a search string, and retrieving the content from the url,
makeURLsearch <- function(key, dates=c(NULL, NULL)){
base.search <- "http://scholar.google.co.uk/scholar?"
key.search <- paste("as_q=", key,"&", sep="")
other.search <- "num=10&btnG=Search
+
Scholar
&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&"
dates.search <- paste("as_ylo=", dates[1], "&as_yhi=", dates[2],
"&as_allsubj=all&hl=en&lr=", sep="")
full.search <- paste(base.search, key.search, other.search,
dates.search, sep="")
return(full.search)
}
makeURLsearch("plasmonics")
makeURLsearch("photonics", c(1980, NULL))
retrieveNumberPublications <- function(url){
x <- readLines(url)
y <- grep('of about',x, value=TRUE)
z <- gsub('of about\\s+</b>','\\1',y[1],perl=TRUE) # this does not
do what I wanted
# the bit to retrieve is the number below
# <b>10</b> of about <b>21,900</b> for <b><b>photonics</b>
z
}
retrieveNumberPublications( makeURLsearch("photonics", c(2008,
NULL)) )
I can isolate the long string containing the result I want, but not single out the value which lies between " <b>10</b> of about <b>21,900</b> for <b><b>photonics</b> " . Any regexp guru to help me out? I've never got my head around these, other than trivial cases. Many thanks, baptiste
On 15 Jan 2009, at 09:45, baptiste auguie wrote:
For the record, I thought I'd share two findings: First, the web of science website does seem to have some sort of API, as discussed here: http://scientific.thomson.com/support/faq/webservices/ It does not seem like a trivial thing to set up though. Second, because I could not pass the search term easily in the address, I looked into Google scholar instead, where a typical search looks like: http://scholar.google.co.uk/scholar?as_q=plasmonics&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=1960&as_allsubj=all&hl=en&lr= here it is trivial to create such a string with the desired keyword and dates, and retrieve the number of results using readLines(url) and grep. Thanks to Phil Spector for some pointers. Best wishes, baptiste
_____________________________ Baptiste Augui? School of Physics University of Exeter Stocker Road, Exeter, Devon, EX4 4QL, UK Phone: +44 1392 264187 http://newton.ex.ac.uk/research/emag