-----Original Message-----
From: g.rudge at bham.ac.uk
Sent: Thu, 16 Apr 2015 17:57:44 +0000
To: r-help at r-project.org
Subject: [R] Extracting xml data to data frames
Hi Rgonauts,
I am trying to parse some xml files of transport data using the
TransExchange format (in this case bus routing information) and obtain
some data.frames for onward processing for a GIS related task. Ideally I
need them in .csv files.
Each file (an example is attached) contains up to 8 tables of information
about transport operators and routing information. I have uploaded an
example that contains all 8. In fact I have some hundreds of similar
files that will need processing. So when I've solved this I will need to
be able to loop through a bunch of them.
I'm new to handling xml data and to the xml package so I don't really
know what I'm doing, this is my first stab at using the xml package.
So far the workflow goes something like this.
#get the file
doc=xmlTreeParse("cen_18-23-D-y11-2.xml")
top=xmlRoot(doc)
#look at the names
top=xmlRoot(doc)
#pick one of them to use, in this case the forth one, 'routes', a table
of information about this particular bus route. using some code from
another forum post, I can get a data.frame with the info i need in it.
OK I need to do some reshaping but I can handle that later
fr4<-(top[[4]])
fr4
xmlSApply(fr4,function(x) xmlSApply(x,xmlValue))
df<-as.data.frame(xmlSApply(fr4,function(x) xmlSApply(x,xmlValue)))
df
#this works but when I try it with another table, the fifth one say, that
captures information about the parts of the journey between stops, it
falls over.
fr5<-(top[[5]])
fr5
xmlSApply(fr5,function(x) xmlSApply(x,xmlValue))
df<-as.data.frame(xmlSApply(fr5,function(x) xmlSApply(x,xmlValue)))
df
Now I guess there is an irregularity in the xml causing this. I gather
from other posts I should use Xpath functionality to interrogate this
section of the data. I've tried reverse engineering some of these
commands I've seen in solutions to irregular xml problems on other forums
but not got to what I want. I'm not really up on xml, but I am assuming
it is something to do with the <JourneySectionPattern id=****> part of
the file is what is causing the problem? This looks like there should be
a field called JouneyPattern ID (only I guess without the space) and then
the ID code as the actual field contents.
So my question is, is there a way to parse this table correctly and
output the resulting df as a csv?
All help gratefully recieved. BTW the link to the searhable r-help
archives seems to be broken.
GavinR