Skip to content

Need help extracting info from XML file using XML package

9 messages · Don MacQueen, David Winsemius, Romain Francois +2 more

#
I have an XML file that has within it the coordinates of some 
polygons that I would like to extract and use in R. The polygons are 
nested rather deeply. For example, I found by trial and error that I 
can extract the coordinates of one of them using functions from the 
XML package:

   doc <- xmlInternalTreeParse('doc.kml')
   docroot <- xmlRoot(doc)
   pgon <-    xmlValue(docroot[[52]][[3]][[7]][[3]][[3]][[1]][[1]])

but this is hardly general!

I'm hoping there is some relatively straightforward way to use 
functions from the XML package to recursively descend the structure 
and return the text strings representing the polygons into, say, a 
list with as many elements as there are polygons. I've been looking 
at several XML documentation files downloaded from 
http://www.omegahat.org/RSXML/ , but since my understanding of XML is 
weak at best, I'm having trouble.  I can deal with converting the 
text strings to an R object suitable for plotting etc.


Here's a look at the structure of this file

graphics[5]% grep Polygon doc.kml
         <Polygon id="15342">
         </Polygon>
         <Polygon id="1073">
         </Polygon>
         <Polygon id="16508">
         </Polygon>
         <Polygon id="18665">
         </Polygon>
         <Polygon id="32903">
         </Polygon>
         <Polygon id="5232">
         </Polygon>

And each of the <Polygon> </Polygon> pairs has <coordinates> as per 
this example:


	<Polygon id="15342">
		<outerBoundaryIs>
			<LinearRing id="11467">
				<coordinates>
-23.679835352296,30.263840290388,5.000000000000001
-23.68138782285701,30.264740875186,5.000000000000001
    [snip]
-23.679835352296,30.263840290388,5.000000000000001
-23.679835352296,30.263840290388,5.000000000000001 </coordinates>
			</LinearRing>
		</outerBoundaryIs>
	</Polygon>


Thanks!
-Don


p.s.
There is a lot of other stuff in this file, i.e, some points, and 
attributes of the points such as color, as well as a legend 
describing what the polygons mean, but I can get by without all that 
stuff, at least for now.

Note also that readOGR() would in principle work, but the underlying 
OGR libraries have some limitations that this file exceeds. Per info 
at http://www.gdal.org/ogr/drv_kml.html.
#
A bit over a year ago I got useful advice from Gabor Grothendieck and  
Duncan Temple Lang in this thread:

http://finzi.psych.upenn.edu/R/Rhelp02/archive/117140.html

If the coordinates are nested deeply, then it probably safer to search  
for a specific tag or tags that are just above them . You probably  
want to search for the "LinearRing" tag and then store the coordinates  
along with its "id". Perhaps some of my mistakes can be avoided as you  
work on your methods.
#
Don MacQueen wrote:
try

    lapply(
       xpathSApply(doc, '//Polygon',
          xpathSApply, '//coordinates', function(node)
              strsplit(xmlValue(node), split=',|\\s+')),
       as.numeric)

which should find all polygon nodes, extract the coordinates node for
each polygon separately, split the coordinates string by comma and
convert to a numeric vector, and then report a list of such vectors, one
vector per polygon.

i've tried it on some dummy data made up from your example below.  the
xpath patterns may need to be adjusted, depending on the actual
structure of your xml file, as may the strsplit pattern.

vQ
#
Hi,

You also might want to check R4X:

# install.packages("R4X", repos="http://R-Forge.R-project.org")
require( "R4X" )
x <- xml("http://code.google.com/apis/kml/documentation/KML_Samples.kml")
coords <- x["////Polygon///coordinates/#" ]
data <- sapply( strsplit( coords, "(,|\\s+)" ), as.numeric )

Romain
#
Romain Francois wrote:
With a bit more formatting :

# install.packages("R4X", repos="http://R-Forge.R-project.org")
require( "R4X" )
x <- xml("http://code.google.com/apis/kml/documentation/KML_Samples.kml")
coords <- x["////Polygon///coordinates/#" ]
data <- lapply( strsplit( coords, "(,|\\s+)" ), function(.){
  out <- matrix( as.numeric(.), ncol = 3, byrow = TRUE )
  colnames( out ) <- c("longitude", "lattitude", "altitude" )
  out
})
names( data ) <- x["//Placemark/name/#" ]

Romain
#
Wacek Kusnierczyk wrote:
Just for the record, I the xpath expression in the
second xpathSApply would need to be
    ".//coordinates"
to start searching from the previously matched Polygon node.
Otherwise, the search starts from the top of the document again.

However, it would seem that

   xpathSApply(doc, "//Polygon//coordinates",
                 function(node) strsplit(.....))

would be more direct, i.e. fetch the coordinates nodes in single
XPath expression.

   D.
#
Duncan Temple Lang wrote:
not really:  the xpath pattern '//coordinates' does say 'find all
coordinates nodes searching from the root', but the root here is not the
original root of the whole document, but each polygon node in turn. 

try:

    root = xmlInternalTreeParse('
        <root>
            <foo>
                <bar>1</bar>
            </foo>
            <foo>
                <bar>2</bar>
            </foo>
        </root>')

    xpathApply(root, '//foo', xpathSApply, '//bar', xmlValue)
    # equals list("1", "2"), not list(c("1", "2"), c("1", "2"))

this is not equivalent to

    xpathApply(root, '//foo', function(foo) xpathSApply(root, '//bar',
xmlValue))

but to

    xpathApply(root, '//foo', function(foo) xpathSApply(foo, '//bar',
xmlValue))


as the author of the XML package, you should know ;)
yes, in this case it would;  i was not sure about the concrete schema. 
i copied the code from my solution to some other problem, where polygon
would have multiple coordinates nodes which would have to be merged in
some way for each polygon separately -- your solution would return the
content of each coordinates nodes separately irrespectively of whether
it is unique within the polygon (which might well be in this particular
case, and thus your solution is undeniably more elegant).


vQ
#
Wacek Kusnierczyk wrote:
Just for the record and to avoid confusion for anyone reading the 
archives in the future,  the behaviour displayed above is from an old
version of the XML package (mid 2008).   Subsequent versions yield
the second result as  the //bar works from the root of the document.
But using .//bar would search from the foo down the sub-tree.

The reason for this is that, having used XPath to get a node, e.g. foo, 
we often want to go back up the XML tree from that current node, e.g.
   ../
   ./ancestor::foo
and so on.

    D.
#
Duncan Temple Lang wrote:
just for the record:  this is an excellent example of where the
behaviour (implementation + interface) of a package changes to make it
better, despite the trillions of lines of old code that get broken by
the change.  thanks, duncan.

vQ