Skip to content
Prev 132164 / 398506 Next

Analyzing Publications from Pubmed via XML

If we can assume that the abstract is always the 4th paragraph then we
can try something like this:

library(XML)
doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE, trim = TRUE)

out <- cbind(
	Author = unlist(xpathApply(doc, "//author", xmlValue)),
	PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", xmlValue))),
	Abstract = unlist(xpathApply(doc, "//description",
		function(x) {
			on.exit(free(doc2))
			doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
				useInternalNodes = TRUE, trim = TRUE)
			xpathApply(doc2, "//p[4]", xmlValue)
		}
	)))
free(doc)
substring(out, 1, 25) # display first 25 chars of each field


The last line produces (it may look messed up in this email):
Author                      PMID       Abstract
 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
 [5,] " Hopp R, Natarajan N, Lew" "17908862" ""
 [6,] " Preuss SF, Klussmann JP," "17851940" "CONCLUSIONS: The presente"
 [7,] " Mouadeb DA, Belafsky PC"  "17765779" "OBJECTIVES: The 585nm pul"
 [8,] " Thompson L"               "17702311" ""
 [9,] " Schaffer A, Brotherton J" "17688640" ""
[10,] " Stephen JK, Vaught LE, C" "17638782" "OBJECTIVE: To investigate"
[11,] " Shah KV, Westra WH"       "17627059" ""
[12,] " Koufman JA, Rees CJ, Fra" "17599582" "BACKGROUND: Unsedated off"
[13,] " Akst LM, Broadhurst MS, " "17592395" ""
[14,] " Pignatari SS, Liriano RY" "17589729" "Evidence of a relation be"
On Dec 15, 2007 10:13 PM, David Winsemius <dwinsemius at comcast.net> wrote:

Thread (26 messages)

Farrel Buchinsky Analyzing Publications from Pubmed via XML Dec 13 Rajarshi Guha Analyzing Publications from Pubmed via XML Dec 13 Farrel Buchinsky Analyzing Publications from Pubmed via XML Dec 13 Gabor Grothendieck Analyzing Publications from Pubmed via XML Dec 13 Rajarshi Guha Analyzing Publications from Pubmed via XML Dec 13 Robert Gentleman Analyzing Publications from Pubmed via XML Dec 13 Farrel Buchinsky Analyzing Publications from Pubmed via XML Dec 14 Farrel Buchinsky Analyzing Publications from Pubmed via XML Dec 14 Gabor Grothendieck Analyzing Publications from Pubmed via XML Dec 14 Duncan Temple Lang Analyzing Publications from Pubmed via XML Dec 14 David Winsemius Analyzing Publications from Pubmed via XML Dec 15 David Winsemius Analyzing Publications from Pubmed via XML Dec 15 Gabor Grothendieck Analyzing Publications from Pubmed via XML Dec 15 David Winsemius Analyzing Publications from Pubmed via XML Dec 16 Gabor Grothendieck Analyzing Publications from Pubmed via XML Dec 16 David Winsemius Analyzing Publications from Pubmed via XML Dec 16 David Winsemius Analyzing Publications from Pubmed via XML Dec 16 Duncan Temple Lang Analyzing Publications from Pubmed via XML Dec 16 Armin Goralczyk Analyzing Publications from Pubmed via XML Dec 17 Martin Morgan Analyzing Publications from Pubmed via XML Dec 17 Armin Goralczyk Analyzing Publications from Pubmed via XML Dec 17 David Winsemius Analyzing Publications from Pubmed via XML Dec 17 David Winsemius Analyzing Publications from Pubmed via XML Dec 17 Armin Goralczyk Analyzing Publications from Pubmed via XML Dec 18 David Winsemius Analyzing Publications from Pubmed via XML Dec 18 Armin Goralczyk Analyzing Publications from Pubmed via XML Dec 19