Analyzing Publications from Pubmed via XML
If we can assume that the abstract is always the 4th paragraph then we
can try something like this:
library(XML)
doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE, trim = TRUE)
out <- cbind(
Author = unlist(xpathApply(doc, "//author", xmlValue)),
PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))),
Abstract = unlist(xpathApply(doc, "//description",
function(x) {
on.exit(free(doc2))
doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
useInternalNodes = TRUE, trim = TRUE)
xpathApply(doc2, "//p[4]", xmlValue)
}
)))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field
The last line produces (it may look messed up in this email):
substring(out, 1, 25) # display it
Author PMID Abstract [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H" [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil" [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o" [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo" [5,] " Hopp R, Natarajan N, Lew" "17908862" "" [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente" [7,] " Mouadeb DA, Belafsky PC" "17765779" "OBJECTIVES: The 585nm pul" [8,] " Thompson L" "17702311" "" [9,] " Schaffer A, Brotherton J" "17688640" "" [10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate" [11,] " Shah KV, Westra WH" "17627059" "" [12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off" [13,] " Akst LM, Broadhurst MS, " "17592395" "" [14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be"
On Dec 15, 2007 10:13 PM, David Winsemius <dwinsemius at comcast.net> wrote:
David Winsemius <dwinsemius at comcast.net> wrote in news:Xns9A077F740B4A0dNOTwinscomcast at 80.91.229.13:
"Farrel Buchinsky" <fjbuch at gmail.com> wrote in news:bd93cdad0712141216s23071d27n17d87a487ad06950 at mail.gmail.com:
On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org> wrote:
or just try looking in the annotate package from Bioconductor
Yip. annotate seems to be the most streamlined way to do this. 1) How does one turn the list that is created into a dataframe whose column names are along the lines of date, title, journal, authors etc
Gabor's example already did that task.
Actually the object returned by Gabor's method was a list of lists. Here
is one way (probably very inefficient) of getting "doc" into a
data.frame:
colvals <-sapply(c("//title", "//author", "//category"), xpathApply,
doc = doc, fun = xmlValue)
titles=as.vector(unlist(colvals[1])[3:17])
# needed to drop extraneous titles for search name and an NCBI header
#>str(colvals)
#List of 3
# $ //title :List of 17
# ..$ : chr "PubMed: (\"Laryngeal Neoplasm..."
# ..$ : chr "NCBI PubMed"
authors=colvals[[2]]
jrnls=colvals[[3]]
# not sure why, but trying to do it in one step failed:
# cites<-data.frame(titles=as.vector(unlist(colvals[1])[3:17]),
# authors=colvals[[2]],jnrls=colvals[[3]])
# Error in data.frame(titles = as.vector(unlist(colvals[1])[3:17]),
# authors = colvals[[2]], :
# arguments imply differing number of rows: 15, 1
# but the following worked
cites<-data.frame(titles=as.vector(titles))
cites$author<-authors
cites$jrnls<-jrnls
cites
I am still wondering how to extract material that does not have an XML
tag. Each item looks like:
<item>
<title>Gastroesophageal reflux in patients with recurrent laryngeal
papillomatosis.</title>
<link>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
tmpl=NoSidebarfile&db=PubMed&cmd=Retrieve&list_uids=17589729
&dopt=Abstract</link>
<description>
<![CDATA[
<table border="0" width="100%"><tr><td align="left"><a
href="http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0034-
72992007000200011&lng=en&nrm=iso&tlng=en"><img
src="http://www.ncbi.nlm.nih.gov/entrez/query/egifs/http:--www.scielo.br-
img-scielo_en.gif" border="0"/></a> </td><td align="right"><a
href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?
db=PubMed&cmd=Display&dopt=PubMed_PubMed&from_uid=17589729">
Related Articles</a></td></tr></table>
<p><b>Gastroesophageal reflux in patients with recurrent
laryngeal papillomatosis.</b></p>
<p>Rev Bras Otorrinolaringol (Engl Ed). 2007 Mar-Apr;73(2):210-4
</p>
<p>Authors: Pignatari SS, Liriano RY, Avelino MA, Testa JR,
Fujita R, De Marco EK</p>
<p>Evidence of a relation between gastroesophaeal reflux and
pediatric respiratory disorders increases every year. Many respiratory
symptoms and clinical conditions such as stridor, chronic cough, and
recurrent pneumonia and bronchitis appear to be related to
gastroesophageal reflux. Some studies have also suggested that
gastroesophageal reflux may be associated with recurrent laryngeal
papillomatosis, contributing to its recurrence and severity. AIM: the aim
of this study was to verify the frequency and intensity of
gastroesophageal reflux in children with recurrent laryngeal
papillomatosis. MATERIAL AND METHODS: ten children of both genders, aged
between 3 and 12 years, presenting laryngeal papillomatosis, were
included in this study. The children underwent 24-hour double-probe pH-
metry. RESULTS: fifty percent of the patients had evidence of
gastroesophageal reflux at the distal sphincter; 90% presented reflux at
the proximal sphincter. CONCLUSION: the frequency of proximal
gastroesophageal reflux is significantly increased in patients with
recurrent laryngeal papillomatosis.</p>
<p>PMID: 17589729 [PubMed - in process]</p> ]]>
</description>
<author>Pignatari SS, Liriano RY, Avelino MA, Testa JR, Fujita R, De
Marco EK</author>
<category>Rev Bras Otorrinolaringol (Engl Ed)</category>
<guid isPermaLink="false">PubMed:17589729</guid>
</item>
I would like to access, for instance, the PMID or the abstract within the
<description> element, but I do not think that they have names in the the
same way that <author> or <category> have xml named nodes. I suspect that
getting the output in a different format, say as MEDLINE, might produce
output that was tagged more completely.
--
David Winsemius
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.