Analyzing Publications from Pubmed via XML
Hi Armin -- See the help page for esearch http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html especially the 'retmax' key. A couple of other thoughts on this thread... 1) using the full path, e.g., ids <- xpathApply(doc, "/eSearchResult/IdList/Id", xmlValue) is likely to lead to less grief in the long run, as you'll only select elements of the node you're interested in, rather than any element, anywhere in the document, labeled 'Id' 2) From a different post in the thread, things like
On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net> wrote:
[snip]
get.info<- function(doc){
df<-cbind(
Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
Journal = unlist(xpathApply(doc, "//Title", xmlValue)),
Pmid = unlist(xpathApply(doc, "//PMID", xmlValue))
)
return(df)
}
will lead to more trouble, because they assume that AbstractText, etc
occur exactly once in each record. It would seem better to extract the
relevant node, and query that, probably defining appropriate
defaults. I started with
xpath_or_na <- function(doc, q) {
res <- xpathApply(doc, q, xmlValue)
if (length(res)==1) res[[1]]
else NA_character_
}
citn <- function(citation){
Abstract <- xpath_or_na(citation,
"/MedlineCitation/Article/Abstract/AbstractText")
Journal <- xpath_or_na(citation,
"/MedlineCitation/Article/Journal/Title")
Pmid <- xpath_or_na(citation,
"/MedlineCitation/PMID")
c(Abstract=Abstract, Journal=Journal, Pmid=Pmid)
}
medline_q <- "/PubmedArticleSet/PubmedArticle/MedlineCitation"
res <- xpathApply(doc, medline_q, citn)
One would still have to coerce res into a data.frame. Also worth
thinking about each of the lines in citn -- e.g., clearly only applies
to Journals. Eventually one wants to consult the DTD (basically, the
contract spelling out the content) of the document, confirm that the
xpath queries will perform correctly, and verify that the document
actually conforms to its DTD.
Following my own advice, I quickly found that doing things 'more
right' becomes quite complicated, and suddenly became satisfied with
the information I can get out of the 'annotate' package.
Martin
"Armin Goralczyk" <agoralczyk at gmail.com> writes:
On Dec 15, 2007 6:31 PM, David Winsemius <dwinsemius at comcast.net> wrote:
After quite a bit of hacking (in the sense of ineffective chopping with
a dull ax), I finally came up with:
pm.srch<- function (){
srch.stem<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
query<-readLines(con=file.choose())
query<-gsub("\\\"","",x=query)
doc<-xmlTreeParse(paste(srch.stem,query,sep=""),isURL = TRUE,
useInternalNodes = TRUE)
return(sapply(c("//Id"), xpathApply, doc = doc, fun = xmlValue) )
}
pm.srch() #choosing the search-file
//Id
[1,] "18046565"
[2,] "17978930"
[3,] "17975511"
[4,] "17935912"
[5,] "17851940"
[6,] "17765779"
[7,] "17688640"
[8,] "17638782"
[9,] "17627059"
[10,] "17599582"
[11,] "17589729"
[12,] "17585283"
[13,] "17568846"
[14,] "17560665"
[15,] "17547971"
[16,] "17428551"
[17,] "17419899"
[18,] "17419519"
[19,] "17385606"
[20,] "17366752"
I tried the example above, but only the first 20 PMIDs will be returned. How can I circumvent this (I guesss its a restraint from pubmed)? -- Armin Goralczyk, M.D. -- Universit?tsmedizin G?ttingen Abteilung Allgemein- und Viszeralchirurgie Rudolf-Koch-Str. 40 39099 G?ttingen -- Dept. of General Surgery University of G?ttingen G?ttingen, Germany -- http://www.chirurgie-goettingen.de
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793