Creating a Data Frame from an XML

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130122/1850a84d/attachment.pl>

Hello,

I'm attempting to read information from an XML into a data frame in R using
the "XML" package. I am unable to get the data into a data frame as I would
like.  I have some sample code below.

*XML Code:*

Header...

Data I want in a data frame:

  <data>
 <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
 <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
 <row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
 <row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
 <row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
 <row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
 <row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
 <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
 <row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
 <row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
 </data>

*R Code:*

doc< -xmlInternalTreeParse ("Sample2.xml")
top <- xmlRoot (doc)
xmlName (top)
names (top)
art <- top [["row"]]
art
**
*Output:*

art<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>

This is where I am having difficulties.  I am unable to "access" additional
rows; ( i.e.  <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" /> )

and I am unable to access the individual entries to actually create the
data frame.  The data frame I would like is as follows:

BRAND    NUM    YEAR    VALUE
GMC        1          1999      10000
FORD       2          2000      12000
GMC        1          2001       12500
   etc........

Any help or suggestions would be appreciated.  Conversly, my eventual goal
would be to take a data frame and write it into an XML in the previously
shown format.

Hi,

You are so close!

You have a number of nodes with the name 'row'.  The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first.  So you really want to use the "[" function instead and then select by order index using "[["

library(XML)
s <- c("  <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")
x <- xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))
x["row"][[1]]
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
x["row"][[2]]
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> 

Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
xmlAttrs(x["row"][[2]])
BRAND     NUM    YEAR   VALUE 
 "FORD"     "1"  "2000" "12000" 

You can use lapply to iterate through each row and apply the xmlAttrs function.  You'll end up with a list if character vectors.
y <- lapply(x["row"], xmlAttrs)
str(y)
List of 10
 $ row: Named chr [1:4] "GMC" "1" "1999" "10000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "FORD" "1" "2000" "12000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "GMC" "1" "2001" "12500"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
	.
	.
	.

Next make a character matrix using do.call and rbind ...
m <- do.call(rbind, y)
str(m)
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10] "row" "row" "row" "row" ...
  ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"

And then on to a data.frame...
d <- as.data.frame(m)
str(d)
'data.frame':	10 obs. of  4 variables:
 $ BRAND: chr  "GMC" "FORD" "GMC" "FORD" ...
 $ NUM  : chr  "1" "1" "1" "1" ...
 $ YEAR : chr  "1999" "2000" "2001" "2002" ...
 $ VALUE: chr  "10000" "12000" "12500" "13000" ...

Cheers,
Ben
Thank you

AG

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine   04575-0475 
http://www.bigelow.org
Hello,

I'm attempting to read information from an XML into a data frame in R using
the "XML" package. I am unable to get the data into a data frame as I would
like.  I have some sample code below.

*XML Code:*

Header...

Data I want in a data frame:

   <data>
  <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
  <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
  <row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
  <row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
  <row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
  <row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
  <row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
  <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
  <row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
  <row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
  </data>

*R Code:*

doc< -xmlInternalTreeParse ("Sample2.xml")
top <- xmlRoot (doc)
xmlName (top)
names (top)
art <- top [["row"]]
art
**
This will get a data frame of character columns
as.data.frame(t(xpathSApply(doc, "//row", xmlAttrs)), stringsAsFactors = FALSE)
BRAND NUM YEAR    VALUE
1    GMC   1 1999    10000
2   FORD   1 2000    12000
3    GMC   1 2001    12500
4   FORD   1 2002    13000
5    GMC   1 2003    14000
6   FORD   1 2004    17000
7    GMC   1 2005    15000
8    GMC   1 1967 PRICLESS
9   FORD   1 2007    17500
10   GMC   1 2008    22000

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
Hi,

May be this also helps:
s <- c("? <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")

Lines1<-gsub("^\\s+| \\s+$","",gsub("[^0-9A-Z]"," ",s))
dat1<-read.table(text=Lines1[Lines1!=""],sep="",header=F,stringsAsFactors=F)
dat1New<-dat1[,seq(2,ncol(dat1),by=2)]
colnames(dat1New)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))

str(dat1New)
#'data.frame':??? 10 obs. of? 4 variables:
# $ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
# $ NUM? : int? 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int? 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr? "10000" "12000" "12500" "13000" ...

#or

Lines2<-gsub(" <.*>","",gsub("^.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\".*","\\1 \\2 \\3 \\4",s))
dat2<-read.table(text=Lines2[Lines2!=""&Lines2!=" "],sep="",header=FALSE,stringsAsFactors=FALSE)
?colnames(dat2)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))
?

?str(dat2)
'data.frame':??? 10 obs. of? 4 variables:
# $ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
# $ NUM? : int? 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int? 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr? "10000" "12000" "12500" "13000" ...

head(dat2,3)
#? BRAND NUM YEAR VALUE
#1?? GMC?? 1 1999 10000
#2? FORD?? 1 2000 12000
#3?? GMC?? 1 2001 12500

A.K.

----- Original Message -----
From: Ben Tupper <btupper at bigelow.org>
To: Adam Gabbert <adamjgabbert at gmail.com>
Cc: r-help at r-project.org
Sent: Tuesday, January 22, 2013 10:13 PM
Subject: Re: [R] Creating a Data Frame from an XML

Hello,

I'm attempting to read information from an XML into a data frame in R using
the "XML" package. I am unable to get the data into a data frame as I would
like.? I have some sample code below.

*XML Code:*

Header...

Data I want in a data frame:

?  <data>
? <row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000" />
? <row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000" />
? <row BRAND="GMC" NUM="1" YEAR="2001" VALUE="12500" />
? <row BRAND="FORD" NUM="1" YEAR="2002" VALUE="13000" />
? <row BRAND="GMC" NUM="1" YEAR="2003" VALUE="14000" />
? <row BRAND="FORD" NUM="1" YEAR="2004" VALUE="17000" />
? <row BRAND="GMC" NUM="1" YEAR="2005" VALUE="15000" />
? <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" />
? <row BRAND="FORD" NUM="1" YEAR="2007" VALUE="17500" />
? <row BRAND="GMC" NUM="1" YEAR="2008" VALUE="22000" />
? </data>

*R Code:*

doc< -xmlInternalTreeParse ("Sample2.xml")
top <- xmlRoot (doc)
xmlName (top)
names (top)
art <- top [["row"]]
art
**
*Output:*

art<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>

This is where I am having difficulties.? I am unable to "access" additional
rows; ( i.e.? <row BRAND="GMC" NUM="1" YEAR="1967" VALUE="PRICLESS" /> )

and I am unable to access the individual entries to actually create the
data frame.? The data frame I would like is as follows:

BRAND? ? NUM? ? YEAR? ? VALUE
GMC? ? ? ? 1? ? ? ? ? 1999? ? ? 10000
FORD? ? ?  2? ? ? ? ? 2000? ? ? 12000
GMC? ? ? ? 1? ? ? ? ? 2001? ? ?  12500
? ? etc........

Any help or suggestions would be appreciated.? Conversly, my eventual goal
would be to take a data frame and write it into an XML in the previously
shown format.

Hi,

You are so close!

You have a number of nodes with the name 'row'.? The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first.? So you really want to use the "[" function instead and then select by order index using "[["

library(XML)
s <- c("? <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")
x <- xmlRoot(xmlTreeParse(s, asText = TRUE, useInternalNodes = TRUE))
x["row"][[1]]
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
x["row"][[2]]
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> 

Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
xmlAttrs(x["row"][[2]])
? BRAND? ?  NUM? ? YEAR?  VALUE 
"FORD"? ?  "1"? "2000" "12000" 

You can use lapply to iterate through each row and apply the xmlAttrs function.? You'll end up with a list if character vectors.
y <- lapply(x["row"], xmlAttrs)
str(y)
List of 10
$ row: Named chr [1:4] "GMC" "1" "1999" "10000"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "FORD" "1" "2000" "12000"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "GMC" "1" "2001" "12500"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
??? .
??? .
??? .

Next make a character matrix using do.call and rbind ...
m <- do.call(rbind, y)
str(m)
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
- attr(*, "dimnames")=List of 2
? ..$ : chr [1:10] "row" "row" "row" "row" ...
? ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"

And then on to a data.frame...
d <- as.data.frame(m)
str(d)
'data.frame':??? 10 obs. of? 4 variables:
$ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
$ NUM? : chr? "1" "1" "1" "1" ...
$ YEAR : chr? "1999" "2000" "2001" "2002" ...
$ VALUE: chr? "10000" "12000" "12500" "13000" ...

Cheers,
Ben
Thank you

AG

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine?  04575-0475 
http://www.bigelow.org

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.