Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy. <?xml version="1.0" encoding="utf-8" ?> - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C> Thanks for any help or direction anyone can provide. As a point of reference, I am using R 2.8.1 and have loaded the XML package.
Example for parsing XML file?
10 messages · Wacek Kusnierczyk, kulwinder banipal, Richard Cotton +3 more
Brigid Mooney wrote:
Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy.
There might be a simpler approach, but this seems to do:
library(XML)
input = xmlParse(
'<?xml version="1.0" encoding="utf-8" ?>
<C S="UnitA" D="1/3/2007" C="24745" F="24648">
<T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" />
<T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" />
<T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" />
<T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" />
<T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" />
<T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
<T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
<T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" />
</C>')
(output = data.frame(t(xpathSApply(input, '//T', xpathSApply, '@*'))))
# N T A B C
# 1 1 9:30:13 AM 30.05 29.85 30.05
# 2 2 9:31:05 AM 29.89 29.78 30.05
# 3 3 9:31:05 AM 29.9 29.86 29.87
# 4 4 9:31:05 AM 29.86 29.86 29.87
# 5 5 9:31:05 AM 29.89 29.86 29.87
# 6 6 9:31:06 AM 29.89 29.85 29.86
# 7 7 9:31:06 AM 29.89 29.85 29.86
# 8 8 9:31:06 AM 29.89 29.85 29.86
output$N
# [1] 1 2 3 4 5 6 7 8
# Levels: 1 2 3 4 5 6 7 8
you may need to reformat the columns.
vQ
Hi Brigid.
Here are a few commands that should do what you want:
bri = xmlParse("myDataFile.xml")
tmp = t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1]
dd = as.data.frame(tmp, stringsAsFactors = FALSE,
row.names = 1:nrow(tmp))
And then you can convert the columns to whatever types you want
using regular R commands.
The basic idea is that for each of the child nodes of C,
i.e. the <T>'s, we want the character vector of attributes
which we can get with xmlAttrs().
Then we stack them together into a matrix, drop the "N"
and then convert the result to a data frame, avoiding
duplicate row names which are all "T".
(BTW, make certain the '-' on the second line is not in the XML content.
I assume that came from bringing the text into mail.)
HTH
D.
Brigid Mooney wrote:
Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy. <?xml version="1.0" encoding="utf-8" ?> - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C> Thanks for any help or direction anyone can provide. As a point of reference, I am using R 2.8.1 and have loaded the XML package.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090520/1def9ec0/attachment-0001.pl>
Hi Kulwinder There seems to be many points of confusion here You appear to have added new files to the installation of the XML package, i.e. norel.xsd and LogCallSummary.bin into examplData. They are not part of the regular XML package installation. Because these are not part of the XML installation, we have no idea what they contain. The binary dump does not show us the real content, just the sequence of bytes. xmlParse() can read a regular XML file or a gzip compressed file. But it cannot make sense of arbitrarily formatted binary files. So if you want help on your task, you might try to explain where you started, and at a higher/more abstract level than binary files. So please give a reproducible example that we might be able to emulate. D.
kulwinder banipal wrote:
Hi,
I am trying to parse XML file ( binary hex) but get an error. Code I am using is:
xsd = xmlTreeParse(system.file("exampleData", "norel.xsd", package = "XML"), isSchema =TRUE) doc = xmlInternalTreeParse(system.file("exampleData", "LogCallSummary.bin", package = "XML")) Start tag expected, '<' not found
xmlParse command results in same error as well:
f = system.file("exampleData", "LogCallSummary.bin", package = "XML") > doc = xmlParse(f)Start tag expected, '<' not found
I am at beginner level with XML and will appreciate any help with this error or general guidance.
Thanks
Kulwinder Banipal
file is:
0000000: 0281 0001 0201 0098 c1d5 c000 0000 0000 ................0000010: 000a c0a8 db35 0055 6000 00af 0001 0001 .....5.U`.......0000020: 5f00 2200 4530 0000 4411 2233 4455 0f08 _.".E0..D."3DU..0000030: 0123 4567 8901 2340 0000 04d2 0000 0000 .#Eg..#@........0000040: 0000 0000 0002 0100 0001 0003 0303 0000 ................0000050: 0000 0000 0100 0000 6400 0000 0100 0000 ........d.......0000060: 6401 0300 0900 00fe fe00 012f 0001 1111 d........../....0000070: 0101 0001 1111 0000 0001 0000 2200 0033 ............"..30000080: 3306 0000 3333 0022 0000 1100 0000 0000 3...33."........0000090: 0033 3400 2300 0011 0000 0001 0000 3335 .34.#.........3500000a0: 0024 0000 1100 0000 0200 0033 3600 2500 .$.........36.%.00000b0: 0011 0000 0003 0000 3337 0026 0000 1100 ........37.&....00000c0: 0000 0400 0033 3800 2700 0011 0000 0005 .....38.'.......00000d0: 5504 7700 8800 0044 4406 0000 2323 0099 U.w....DD...##..00000e0: 0100 0200 0000 0000 0023 2400 9901 0002 .........#$...
...00000f0: 0000 0001 0000 2325 0099 0100 0200 0000 ......#%........0000100: 0200 0023 2600 9901 0002 0000 0003 0000 ...#&...........0000110: 2327 0099 0100 0200 0000 0400 0023 2800 #'...........#(.0000120: 9901 0002 0000 0005 0102 0008 0100 0066 ...............f0000130: 6600 0055 5533 0000 0000 3400 0000 0a35 f..UU3....4....50000140: 0000 0014 3600 0000 1e37 0000 0028 3800 ....6....7...(8.0000150: 0000 3239 0000 003c 3a00 0000 463b 0000 ..29...<:...F;..0000160: 0050 3c00 0000 5a00 0088 8800 0077 7744 .P<...Z......wwD0000170: 0000 0000 4500 0000 0a46 0000 0014 4700 ....E....F....G.0000180: 0000 1e48 0000 0028 4900 0000 324a 0000 ...H...(I...2J..0000190: 003c 4b00 0000 464c 0000 0050 4d00 0000 .<K...FL...PM...00001a0: 5a02 2207 7766 6604 0500 0000 1100 0088 Z.".wff.........00001b0: 8800 0000 0106 0000 0011 0000 8889 0000 ................00001c0: 0011 0700 0000 1100 0088 8a00 0000 2108 ..............!.00001d0: 0000 0011 0000 888b 0000 0031 0405 0000 ...........1.. ...00001e0: 0022 0000 0044 0000 0001 0600 0000 2200 ."...D........".00001f0: 0000 4500 0000 1107 0000 0022 0000 0046 ..E........"...F0000200: 0000 0021 0800 0000 2200 0000 4700 0000 ...!...."...G...0000210: 3106 0000 0001 0002 0003 0004 0005 0200 1...............0000220: 0033 4400 0055 6609 0101 0202 0303 0404 .3D..Uf.........0000230: 0505 0606 0707 0808 0909 0405 0000 0011 ................0000240: 0000 0044 0000 0022 0000 0088 0500 0000 ...D..."........0000250: 1200 0000 4500 0000 2300 0000 8905 0000 ....E...#.......0000260: 0013 0000 0046 0000 0024 0000 008a 0500 .....F...$......0000270: 0000 1400 0000 4700 0000 2500 0000 8bfa ......G...%.....0000280: ae
_________________________________________________________________ Hotmail? goes with you. ial_Mobile1_052009 [[alternative HTML version deleted]] ------------------------------------------------------------------------ ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I am trying to parse XML file ( binary hex) but get an error.
Code I am using is:
xsd = xmlTreeParse(system.file("exampleData", "norel.xsd", package =
"XML"), isSchema =TRUE) doc = xmlInternalTreeParse(system.
file("exampleData", "LogCallSummary.bin", package = "XML")) Start
tag expected, '<' not found
xmlParse command results in same error as well:
f = system.file("exampleData", "LogCallSummary.bin", package =
"XML") > doc = xmlParse(f)Start tag expected, '<' not found
I am at beginner level with XML and will appreciate any help with
this error or general guidance.
Thanks
Kulwinder Banipal
file is:
0000000: 0281 0001 0201 0098 c1d5 c000 0000 0000 ................
0000010: 000a c0a8 db35 0055 6000 00af 0001 0001 .....5.U`.......
0000020: 5f00 2200 4530 0000 4411 2233 4455 0f08 _.".E0..D."3DU..
0000030: 0123 4567 8901 2340 0000 04d2 0000 0000 .#Eg..#@........
0000040: 0000 0000 0002 0100 0001 0003 0303 0000 ................
0000050: 0000 0000 0100 0000 6400 0000 0100 0000 ........d.......
0000060: 6401 0300 0900 00fe fe00 012f 0001 1111 d........../....
0000070: 0101 0001 1111 0000 0001 0000 2200 0033 ............"..
30000080: 3306 0000 3333 0022 0000 1100 0000 0000 3...33."........
0000090: 0033 3400 2300 0011 0000 0001 0000 3335 .34.#.........
3500000a0: 0024 0000 1100 0000 0200 0033 3600 2500 .$.........36.%.
00000b0: 0011 0000 0003 0000 3337 0026 0000 1100 ........37.&....
00000c0: 0000 0400 0033 3800 2700 0011 0000 0005 .....38.'.......
00000d0: 5504 7700 8800 0044 4406 0000 2323 0099 U.w....DD...##..
00000e0: 0100 0200 0000 0000 0023 2400 9901 0002 .........#$.....
00000f0: 0000 0001 0000 2325 0099 0100 0200 0000 ......#%........
0000100: 0200 0023 2600 9901 0002 0000 0003 0000 ...#&...........
0000110: 2327 0099 0100 0200 0000 0400 0023 2800 #'...........#(.
0000120: 9901 0002 0000 0005 0102 0008 0100 0066 ...............
f0000130: 6600 0055 5533 0000 0000 3400 0000 0a35 f..UU3....4....
50000140: 0000 0014 3600 0000 1e37 0000 0028 3800 ....6....7...(8.
0000150: 0000 3239 0000 003c 3a00 0000 463b 0000 ..29...<:...F;..
0000160: 0050 3c00 0000 5a00 0088 8800 0077 7744 .P<...Z......
wwD0000170: 0000 0000 4500 0000 0a46 0000 0014 4700 ....E....F....
G.0000180: 0000 1e48 0000 0028 4900 0000 324a 0000 ...H...(I...2J..
0000190: 003c 4b00 0000 464c 0000 0050 4d00 0000 .<K...FL...PM...
00001a0: 5a02 2207 7766 6604 0500 0000 1100 0088 Z.".wff.........
00001b0: 8800 0000 0106 0000 0011 0000 8889 0000 ................
00001c0: 0011 0700 0000 1100 0088 8a00 0000 2108 ..............!.
00001d0: 0000 0011 0000 888b 0000 0031 0405 0000 ...........1....
00001e0: 0022 0000 0044 0000 0001 0600 0000 2200 ."...D........".
00001f0: 0000 4500 0000 1107 0000 0022 0000 0046 ..E........"...
F0000200: 0000 0021 0800 0000 2200 0000 4700 0000 ...!...."...G...
0000210: 3106 0000 0001 0002 0003 0004 0005 0200 1...............
0000220: 0033 4400 0055 6609 0101 0202 0303 0404 .3D..Uf.........
0000230: 0505 0606 0707 0808 0909 0405 0000 0011 ................
0000240: 0000 0044 0000 0022 0000 0088 0500 0000 ...D..."........
0000250: 1200 0000 4500 0000 2300 0000 8905 0000 ....E...#.......
0000260: 0013 0000 0046 0000 0024 0000 008a 0500 .....F...$......
0000270: 0000 1400 0000 4700 0000 2500 0000 8bfa
......G...%.....0000280: ae Um, this isn't an XML file. An XML file should look something like this: <?xml version="1.0" encoding="utf-8" ?> <tag> <subtag>value</subtag> </tag> The wikipedia entry on XML gives a reasonable intro to the format. http://en.wikipedia.org/wiki/Xml Regards, Richie. Mathematical Sciences Unit HSL ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential inform...{{dropped:20}}
Thanks! That helps a lot! A quick follow-up question - I can't really tell what part of the commands tell it to only look at the child nodes of <C>. Is there any way to also access the fields that are in the <C> heirarchy? (ie the S, D, C, and F) I wouldn't necessarily want those repeated thousands of times in the data frame, but C and F are useful reference points as they are actually row numbers where specific events occurred. Thanks again for all the help! -Brigid On Wed, May 20, 2009 at 5:16 PM, Duncan Temple Lang
<duncan at wald.ucdavis.edu> wrote:
Hi Brigid.
Here are a few commands that should do what you want:
bri = xmlParse("myDataFile.xml")
tmp = ?t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1]
dd = as.data.frame(tmp, stringsAsFactors = FALSE,
? ? ? ? ? ? ? ? ? ?row.names = 1:nrow(tmp))
And then you can convert the columns to whatever types you want
using regular R commands.
The basic idea is that for each of the child nodes of C,
i.e. the <T>'s, we want the character vector of attributes
which we can get with xmlAttrs().
Then we stack them together into a matrix, drop the "N"
and then convert the result to a data frame, avoiding
duplicate row names which are all "T".
(BTW, make certain the '-' on the second line is not in the XML content.
?I assume that came from bringing the text into mail.)
HTH
?D.
Brigid Mooney wrote:
Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. ?If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy. ?<?xml version="1.0" encoding="utf-8" ?> - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> ?<T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> ?<T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> ?<T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> ?<T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> ?<T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> ?<T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> ?<T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> ?<T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C> Thanks for any help or direction anyone can provide. As a point of reference, I am using R 2.8.1 and have loaded the XML package.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brigid Mooney wrote:
Thanks! That helps a lot! A quick follow-up question - I can't really tell what part of the commands tell it to only look at the child nodes of <C>.
xmlRoot(bri) gives us the C node. xmlSApply(node, f) is short-hand for sapply(xmlChildren(node), f) so that is where we loop over the children. >Is there any
way to also access the fields that are in the <C> heirarchy? (ie the S, D, C, and F)
xmlAttrs(xmlRoot(bri))
I wouldn't necessarily want those repeated thousands of times in the data frame, but C and F are useful reference points as they are actually row numbers where specific events occurred. Thanks again for all the help! -Brigid On Wed, May 20, 2009 at 5:16 PM, Duncan Temple Lang <duncan at wald.ucdavis.edu> wrote:
Hi Brigid.
Here are a few commands that should do what you want:
bri = xmlParse("myDataFile.xml")
tmp = t(xmlSApply(xmlRoot(bri), xmlAttrs))[, -1]
dd = as.data.frame(tmp, stringsAsFactors = FALSE,
row.names = 1:nrow(tmp))
And then you can convert the columns to whatever types you want
using regular R commands.
The basic idea is that for each of the child nodes of C,
i.e. the <T>'s, we want the character vector of attributes
which we can get with xmlAttrs().
Then we stack them together into a matrix, drop the "N"
and then convert the result to a data frame, avoiding
duplicate row names which are all "T".
(BTW, make certain the '-' on the second line is not in the XML content.
I assume that came from bringing the text into mail.)
HTH
D.
Brigid Mooney wrote:
Hi, I am trying to parse XML files and read them into R as a data frame, but have been unable to find examples which I could apply successfully. I'm afraid I don't know much about XML, which makes this all the more difficult. If someone could point me in the right direction to a resource (preferably with an example or two), it would be greatly appreciated. Here is a snippet from one of the XML files that I am looking to read, and I am aiming to be able to get it into a data frame with columns N, T, A, B, C as in the 2nd level of the heirarchy. <?xml version="1.0" encoding="utf-8" ?> - <C S="UnitA" D="1/3/2007" C="24745" F="24648"> <T N="1" T="9:30:13 AM" A="30.05" B="29.85" C="30.05" /> <T N="2" T="9:31:05 AM" A="29.89" B="29.78" C="30.05" /> <T N="3" T="9:31:05 AM" A="29.9" B="29.86" C="29.87" /> <T N="4" T="9:31:05 AM" A="29.86" B="29.86" C="29.87" /> <T N="5" T="9:31:05 AM" A="29.89" B="29.86" C="29.87" /> <T N="6" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="7" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> <T N="8" T="9:31:06 AM" A="29.89" B="29.85" C="29.86" /> </C> Thanks for any help or direction anyone can provide. As a point of reference, I am using R 2.8.1 and have loaded the XML package.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
3 days later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090524/e18802a6/attachment-0001.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090524/3551d4fa/attachment-0001.pl>