Creating a Data Frame from an XML
On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:
Hello,
I'm attempting to read information from an XML into a data frame in R using
the "XML" package. I am unable to get the data into a data frame as I would
like. I have some sample code below.
*XML Code:*
Header...
Data I want in a data frame:
<data>
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
<row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
<row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
<row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
<row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
<row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
<row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
<row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
<row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
</data>
*R Code:*
doc< -xmlInternalTreeParse ("Sample2.xml")
top <- xmlRoot (doc)
xmlName (top)
names (top)
art <- top [["row"]]
art
**
*Output:*
art<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
This is where I am having difficulties. I am unable to "access" additional rows; ( i.e. <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" /> ) and I am unable to access the individual entries to actually create the data frame. The data frame I would like is as follows: BRAND NUM YEAR VALUE GMC 1 1999 10000 FORD 2 2000 12000 GMC 1 2001 12500 etc........ Any help or suggestions would be appreciated. Conversly, my eventual goal would be to take a data frame and write it into an XML in the previously shown format.
Hi, You are so close! You have a number of nodes with the name 'row'. The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first. So you really want to use the "[" function instead and then select by order index using "[[" library(XML)
s <- c(" <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />",
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", " <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", " <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", " <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", " </data>")
x <- xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))
x["row"][[1]]
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
x["row"][[2]]
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
xmlAttrs(x["row"][[2]])
BRAND NUM YEAR VALUE "FORD" "1" "2000" "12000" You can use lapply to iterate through each row and apply the xmlAttrs function. You'll end up with a list if character vectors.
y <- lapply(x["row"], xmlAttrs) str(y)
List of 10 $ row: Named chr [1:4] "GMC" "1" "1999" "10000" ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE" $ row: Named chr [1:4] "FORD" "1" "2000" "12000" ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE" $ row: Named chr [1:4] "GMC" "1" "2001" "12500" ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE" . . . Next make a character matrix using do.call and rbind ...
m <- do.call(rbind, y) str(m)
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:10] "row" "row" "row" "row" ... ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE" And then on to a data.frame...
d <- as.data.frame(m) str(d)
'data.frame': 10 obs. of 4 variables: $ BRAND: chr "GMC" "FORD" "GMC" "FORD" ... $ NUM : chr "1" "1" "1" "1" ... $ YEAR : chr "1999" "2000" "2001" "2002" ... $ VALUE: chr "10000" "12000" "12500" "13000" ... Cheers, Ben
Thank you AG [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Ben Tupper Bigelow Laboratory for Ocean Sciences 180 McKown Point Rd. P.O. Box 475 West Boothbay Harbor, Maine 04575-0475 http://www.bigelow.org