Extract just some fields from XML]
Duncan, you are a king! Thanks a lot for this cookie. It really helped me. Thanks for the code as well as detailed explanation at the end.
Hi Gregor.
Here is a function that will collect all of the nodes in the
XML document whose names are in the vector elementNames
getElements =
function(elementNames)
{
els = list()
startElement = function(node, ...) {
if(xmlName(node) %in% elementNames)
els[[length(els) + 1]] <<- node
node
}
list(startElement = startElement, els = function() els)
}
So you can use it as
myHandlers = getElements("PubDate")
xmlTreeParse(URL, handlers = myHandlers)
And then
myHandlers$els()
returns a list of the the three PubDate elements in the document.
If you wanted both PubDate and PubMedPubDate elements,
you could use
myHandlers = getElements(c("PubDate", "PubMedPubDate")
[Note that XML is case-sensitive and pubdate won't work.]
The xmlEventParse is quite a bit more work as it is for
very low-level parsing, working at the parser level
of opening and closing XML elements.
The xmlTreeParse is a hybrid parser. It works at the higher
level of nodes, but provides an opportunity to process
nodes when they are "created" and before their parent
nodes have been processed. So it works bottom up
(in one of its modes).
You can also use xmlDOMApply() to iterate over all the
nodes of a parsed XML tree. You give xmlDOMApply() a
function and it can do whatever it wants, including
checking the name of the node to see if you want it
and then storing it somewhere. That's where you'll
need closures (simply viewed the "functions within functions" part) again,
as in my example above.
But here is a simple example
doc = xmlRoot(xmlTreeParse(URL))
xmlDOMApply(doc, function(node, ...)
if(xmlName(node) == "PubDate")
print(node)
)
Gorjanc Gregor wrote:
Hello! I am trying to get specific fields from an XML document and I am totally puzzled. I hope someone can help me. # URL URL<-"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11877539,11822933,11871444&retmode=xml&rettype=citation" # download a XML file tmp <- xmlTreeParse(URL, isURL = TRUE) tmp <- xmlRoot(tmp) Now I want to extract only node 'pubdate' and its children, but I don't know how to do that unless I try to dig into the structure of the XML file. The problem is that structure can differ and then hardcoded set of list indices i.e. tmp[[i]][[j]]... doesn't help me. I've read xmlEventParse but I don't understand handlers part up to the point that I could get anything usable from it. Here is something not very usable ;) PubDate <- function(x, ...) { print(x) } xmlEventParse(URL, isURL = TRUE, handlers=list(PubDate=PubDate), addContext = FALSE) Thanks in advance! Lep pozdrav / With regards, Gregor Gorjanc ---------------------------------------------------------------------- University of Ljubljana Biotechnical Faculty URI: http://www.bfro.uni-lj.si/MR/ggorjan Zootechnical Department mail: gregor.gorjanc <at> bfro.uni-lj.si Groblje 3 tel: +386 (0)1 72 17 861 SI-1230 Domzale fax: +386 (0)1 72 17 888 Slovenia, Europe ---------------------------------------------------------------------- "One must learn by doing the thing; for though you think you know it, you have no certainty until you try." Sophocles ~ 450 B.C.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
--
Duncan Temple Lang duncan at wald.ucdavis.edu
Department of Statistics work: (530) 752-4782
371 Kerr Hall fax: (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA
--
Lep pozdrav / With regards,
Gregor Gorjanc
----------------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty URI: http://www.bfro.uni-lj.si/MR/ggorjan
Zootechnical Department mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3 tel: +386 (0)1 72 17 861
SI-1230 Domzale fax: +386 (0)1 72 17 888
Slovenia, Europe
----------------------------------------------------------------------
"One must learn by doing the thing; for though you think you know it,
you have no certainty until you try." Sophocles ~ 450 B.C.
----------------------------------------------------------------------
--
Lep pozdrav / With regards,
Gregor Gorjanc
----------------------------------------------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty URI: http://www.bfro.uni-lj.si/MR/ggorjan
Zootechnical Department mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3 tel: +386 (0)1 72 17 861
SI-1230 Domzale fax: +386 (0)1 72 17 888
Slovenia, Europe
----------------------------------------------------------------------------------------------------
"One must learn by doing the thing; for though you think you know it,
you have no certainty until you try." Sophocles ~ 450 B.C.