Analyzing Publications from Pubmed via XML
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
David Winsemius wrote:
On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
If we can assume that the abstract is always the 4th paragraph then we
can try something like this:
library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
useInternalNodes = TRUE, trim = TRUE)
out <- cbind(
Author = unlist(xpathApply(doc, "//author", xmlValue)),
PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
xmlValue))),
Abstract = unlist(xpathApply(doc, "//description",
function(x) {
on.exit(free(doc2))
doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
useInternalNodes = TRUE, trim = TRUE)
xpathApply(doc2, "//p[4]", xmlValue)
}
)))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field
The last line produces (it may look messed up in this email):
substring(out, 1, 25) # display it
Author PMID Abstract
[1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H" [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil" [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o" [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo" snip
It looked beautifully regular in my newsreader. It is helpful to see an
example showing the indexed access to nodes. It was also helpful to see the
example of substring for column display. Thank you (for this and all of
your other contributions.)
I find upon further browsing that the pmfetch access point is obsolete.
Experimentation with the PubMed eFetch server access point results in fully
xml-tagged results:
e.fetch.doc<- function (){
fetch.stem <-
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
src.mode <- "db=pubmed&retmode=xml&"
request <- "id=11045395"
doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
isURL = TRUE, useInternalNodes = TRUE)
}
# in the debugging phase I needed to set useInternalNodes = TRUE to see the
tags. Never did find a way to "print" them when internal.
saveXML(node) will return a string giving the XML content of that node as tree.
doc<-e.fetch.doc()
get.info<- function(doc){
df<-cbind(
Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
Journal = unlist(xpathApply(doc, "//Title", xmlValue)),
Pmid = unlist(xpathApply(doc, "//PMID", xmlValue))
)
return(df)
}
# this works
substring(get.info(doc), 1, 25)
Abstract Journal Pmid [1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT zdLB6mkNPFh5lWgNgb70sDc= =SR2E -----END PGP SIGNATURE-----