Skip to content

Example for parsing XML file?

10 messages · Wacek Kusnierczyk, kulwinder banipal, Richard Cotton +3 more

#
Hi,

I am trying to parse XML files and read them into R as a data frame,
but have been unable to find examples which I could apply
successfully.

I'm afraid I don't know much about XML, which makes this all the more
difficult.  If someone could point me in the right direction to a
resource (preferably with an example or two), it would be greatly
appreciated.

Here is a snippet from one of the XML files that I am looking to read,
and I am aiming to be able to get it into a data frame with columns N,
T, A, B, C as in the 2nd level of the heirarchy.

  <?xml version="1.0" encoding="utf-8" ?>
- <C S="UnitA" D="1/3/2007" C="24745" F="24648">
  <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" />
  <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" />
  <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" />
  <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" />
  <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" />
  <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
  <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
  <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
</C>

Thanks for any help or direction anyone can provide.

As a point of reference, I am using R 2.8.1 and have loaded the XML package.
#
Brigid Mooney wrote:
There might be a simpler approach, but this seems to do:

    library(XML)

    input = xmlParse(
'<?xml version="1.0" encoding="utf-8" ?>
  <C S="UnitA" D="1/3/2007" C="24745" F="24648">
  <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" />
  <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" />
  <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" />
  <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" />
  <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" />
  <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
  <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
  <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
</C>')

    (output = data.frame(t(xpathSApply(input, '//T', xpathSApply, '@*'))))
    #      N          T     A     B     C
    # 1 1 9:30:13 AM 30.05 29.85 30.05
    # 2 2 9:31:05 AM 29.89 29.78 30.05
    # 3 3 9:31:05 AM  29.9 29.86 29.87
    # 4 4 9:31:05 AM 29.86 29.86 29.87
    # 5 5 9:31:05 AM 29.89 29.86 29.87
    # 6 6 9:31:06 AM 29.89 29.85 29.86
    # 7 7 9:31:06 AM 29.89 29.85 29.86
    # 8 8 9:31:06 AM 29.89 29.85 29.86

    output$N
    # [1] 1 2 3 4 5 6 7 8
    # Levels: 1 2 3 4 5 6 7 8

you may need to reformat the columns.

vQ
#
Hi Brigid.

Here are a few commands that should do what you want:

bri = xmlParse("myDataFile.xml")

tmp =  t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1]
dd = as.data.frame(tmp, stringsAsFactors = FALSE,
                     row.names = 1:nrow(tmp))

And then you can convert the columns to whatever types you want
using regular R commands.

The basic idea is that for each of the child nodes of C,
i.e. the <T>'s, we want the character vector of attributes
which we can get with xmlAttrs().

Then we stack them together into a matrix, drop the "N"
and then convert the result to a data frame, avoiding
duplicate row names which are all "T".

(BTW, make certain the '-' on the second line is not in the XML content.
  I assume that came from bringing the text into mail.)

HTH
   D.
Brigid Mooney wrote:
#
Hi Kulwinder

There seems to be many points of confusion here
You appear to have added new files to the installation of the XML 
package, i.e. norel.xsd and LogCallSummary.bin into examplData.
They are not part of the regular XML package installation.

Because these are not part of the XML installation, we have no
idea what they contain.  The binary dump does not show us
the real content, just the sequence of bytes.

xmlParse() can read a regular XML file or a gzip compressed
file. But it cannot make sense of arbitrarily formatted
binary files.

So if you want help on your task, you might try to explain
where you started, and at a higher/more abstract level than
binary files.

So please give a reproducible example that we might be able to emulate.

  D.
kulwinder banipal wrote:
...00000f0: 0000 0001 0000 2325 0099 0100 0200 0000  ......#%........0000100: 0200 0023 2600 9901 0002 0000 0003 0000  ...#&...........0000110: 2327 0099 0100 0200 0000 0400 0023 2800  #'...........#(.0000120: 9901 0002 0000 0005 0102 0008 0100 0066  ...............f0000130: 6600 0055 5533 0000 0000 3400 0000 0a35  f..UU3....4....50000140: 0000 0014 3600 0000 1e37 0000 0028 3800  ....6....7...(8.0000150: 0000 3239 0000 003c 3a00 0000 463b 0000  ..29...<:...F;..0000160: 0050 3c00 0000 5a00 0088 8800 0077 7744  .P<...Z......wwD0000170: 0000 0000 4500 0000 0a46 0000 0014 4700  ....E....F....G.0000180: 0000 1e48 0000 0028 4900 0000 324a 0000  ...H...(I...2J..0000190: 003c 4b00 0000 464c 0000 0050 4d00 0000  .<K...FL...PM...00001a0: 5a02 2207 7766 6604 0500 0000 1100 0088  Z.".wff.........00001b0: 8800 0000 0106 0000 0011 0000 8889 0000  ................00001c0: 0011 0700 0000 1100 0088 8a00 0000 2108  ..............!.00001d0: 0000 0011 0000 888b 0000 0031 0405 0000  ...........1..
...00001e0: 0022 0000 0044 0000 0001 0600 0000 2200  ."...D........".00001f0: 0000 4500 0000 1107 0000 0022 0000 0046  ..E........"...F0000200: 0000 0021 0800 0000 2200 0000 4700 0000  ...!...."...G...0000210: 3106 0000 0001 0002 0003 0004 0005 0200  1...............0000220: 0033 4400 0055 6609 0101 0202 0303 0404  .3D..Uf.........0000230: 0505 0606 0707 0808 0909 0405 0000 0011  ................0000240: 0000 0044 0000 0022 0000 0088 0500 0000  ...D..."........0000250: 1200 0000 4500 0000 2300 0000 8905 0000  ....E...#.......0000260: 0013 0000 0046 0000 0024 0000 008a 0500  .....F...$......0000270: 0000 1400 0000 4700 0000 2500 0000 8bfa  ......G...%.....0000280: ae
#
......G...%.....0000280: ae

Um, this isn't an XML file. An XML file should look something like this:

<?xml version="1.0" encoding="utf-8" ?>
<tag>
   <subtag>value</subtag>
</tag>

The wikipedia entry on XML gives a reasonable intro to the format.  
http://en.wikipedia.org/wiki/Xml

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}
#
Thanks!  That helps a lot!

A quick follow-up question - I can't really tell what part of the
commands tell it to only look at the child nodes of <C>.  Is there any
way to also access the fields that are in the <C> heirarchy?  (ie the
S, D, C, and F)

I wouldn't necessarily want those repeated thousands of times in the
data frame, but C and F are useful reference points as they are
actually row numbers where specific events occurred.

Thanks again for all the help!
-Brigid



On Wed, May 20, 2009 at 5:16 PM, Duncan Temple Lang
<duncan at wald.ucdavis.edu> wrote:
#
Brigid Mooney wrote:
xmlRoot(bri) gives us the C node.

xmlSApply(node, f) is short-hand for
   sapply(xmlChildren(node), f)
so that is where we loop over the children.

 >Is there any
xmlAttrs(xmlRoot(bri))
3 days later