Skip to content

Creating a Data Frame from an XML

4 messages · Adam Gabbert, Ben Tupper, Gabor Grothendieck +1 more

#
On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:

            
Hi,

You are so close!

You have a number of nodes with the name 'row'.  The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first.  So you really want to use the "[" function instead and then select by order index using "[["

library(XML)
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> 

Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
BRAND     NUM    YEAR   VALUE 
 "FORD"     "1"  "2000" "12000" 


You can use lapply to iterate through each row and apply the xmlAttrs function.  You'll end up with a list if character vectors.
List of 10
 $ row: Named chr [1:4] "GMC" "1" "1999" "10000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "FORD" "1" "2000" "12000"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
 $ row: Named chr [1:4] "GMC" "1" "2001" "12500"
  ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
	.
	.
	.

Next make a character matrix using do.call and rbind ...
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10] "row" "row" "row" "row" ...
  ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"

And then on to a data.frame...
'data.frame':	10 obs. of  4 variables:
 $ BRAND: chr  "GMC" "FORD" "GMC" "FORD" ...
 $ NUM  : chr  "1" "1" "1" "1" ...
 $ YEAR : chr  "1999" "2000" "2001" "2002" ...
 $ VALUE: chr  "10000" "12000" "12500" "13000" ...

Cheers,
Ben
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine   04575-0475 
http://www.bigelow.org
#
On Tue, Jan 22, 2013 at 3:11 PM, Adam Gabbert <adamjgabbert at gmail.com> wrote:
This will get a data frame of character columns
BRAND NUM YEAR    VALUE
1    GMC   1 1999    10000
2   FORD   1 2000    12000
3    GMC   1 2001    12500
4   FORD   1 2002    13000
5    GMC   1 2003    14000
6   FORD   1 2004    17000
7    GMC   1 2005    15000
8    GMC   1 1967 PRICLESS
9   FORD   1 2007    17500
10   GMC   1 2008    22000


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
#
Hi,

May be this also helps:
s <- c("? <data>", " <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1999\" VALUE=\"10000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")


Lines1<-gsub("^\\s+| \\s+$","",gsub("[^0-9A-Z]"," ",s))
dat1<-read.table(text=Lines1[Lines1!=""],sep="",header=F,stringsAsFactors=F)
dat1New<-dat1[,seq(2,ncol(dat1),by=2)]
colnames(dat1New)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))


str(dat1New)
#'data.frame':??? 10 obs. of? 4 variables:
# $ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
# $ NUM? : int? 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int? 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr? "10000" "12000" "12500" "13000" ...

#or

Lines2<-gsub(" <.*>","",gsub("^.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\"\\s+.*=\"(.*)\".*","\\1 \\2 \\3 \\4",s))
dat2<-read.table(text=Lines2[Lines2!=""&Lines2!=" "],sep="",header=FALSE,stringsAsFactors=FALSE)
?colnames(dat2)<- unlist(unique(dat1[,seq(1,ncol(dat1),by=2)]))
?

?str(dat2)
'data.frame':??? 10 obs. of? 4 variables:
# $ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
# $ NUM? : int? 1 1 1 1 1 1 1 1 1 1
# $ YEAR : int? 1999 2000 2001 2002 2003 2004 2005 1967 2007 2008
# $ VALUE: chr? "10000" "12000" "12500" "13000" ...


head(dat2,3)
#? BRAND NUM YEAR VALUE
#1?? GMC?? 1 1999 10000
#2? FORD?? 1 2000 12000
#3?? GMC?? 1 2001 12500


A.K.


----- Original Message -----
From: Ben Tupper <btupper at bigelow.org>
To: Adam Gabbert <adamjgabbert at gmail.com>
Cc: r-help at r-project.org
Sent: Tuesday, January 22, 2013 10:13 PM
Subject: Re: [R] Creating a Data Frame from an XML
On Jan 22, 2013, at 3:11 PM, Adam Gabbert wrote:

            
Hi,

You are so close!

You have a number of nodes with the name 'row'.? The "[[" function selects just one item from a list, and when there's a number that have that name it returns just the first.? So you really want to use the "[" function instead and then select by order index using "[["

library(XML)
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2000\" VALUE=\"12000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2001\" VALUE=\"12500\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2002\" VALUE=\"13000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2003\" VALUE=\"14000\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2004\" VALUE=\"17000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2005\" VALUE=\"15000\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"1967\" VALUE=\"PRICLESS\" />", 
" <row BRAND=\"FORD\" NUM=\"1\" YEAR=\"2007\" VALUE=\"17500\" />", 
" <row BRAND=\"GMC\" NUM=\"1\" YEAR=\"2008\" VALUE=\"22000\" />", 
" </data>")
<row BRAND="GMC" NUM="1" YEAR="1999" VALUE="10000"/>
<row BRAND="FORD" NUM="1" YEAR="2000" VALUE="12000"/> 

Your rows are set up so the attributes have the values you want - use xmlAttrs to retrieve them.
? BRAND? ?  NUM? ? YEAR?  VALUE 
"FORD"? ?  "1"? "2000" "12000" 


You can use lapply to iterate through each row and apply the xmlAttrs function.? You'll end up with a list if character vectors.
List of 10
$ row: Named chr [1:4] "GMC" "1" "1999" "10000"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "FORD" "1" "2000" "12000"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
$ row: Named chr [1:4] "GMC" "1" "2001" "12500"
? ..- attr(*, "names")= chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"
??? .
??? .
??? .

Next make a character matrix using do.call and rbind ...
chr [1:10, 1:4] "GMC" "FORD" "GMC" "FORD" "GMC" "FORD" "GMC" "GMC" "FORD" ...
- attr(*, "dimnames")=List of 2
? ..$ : chr [1:10] "row" "row" "row" "row" ...
? ..$ : chr [1:4] "BRAND" "NUM" "YEAR" "VALUE"

And then on to a data.frame...
'data.frame':??? 10 obs. of? 4 variables:
$ BRAND: chr? "GMC" "FORD" "GMC" "FORD" ...
$ NUM? : chr? "1" "1" "1" "1" ...
$ YEAR : chr? "1999" "2000" "2001" "2002" ...
$ VALUE: chr? "10000" "12000" "12500" "13000" ...

Cheers,
Ben
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine?  04575-0475 
http://www.bigelow.org

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.