Hi,
On Apr 16, 2013, at 2:49 PM, santiago gil wrote:
2013/4/14 santiago gil <sg.ccnr at gmail.com>:
Hello all,
I have a problem with the way attributes are dealt with in the
function xmlToList(), and I haven't been able to figure it out for
days now.
I have not used xmlToList(), but I find what you try below works if you specify useInternalNodes = TRUE in your invocation of xmlTreeParse. Often that is the solution for many issues with xml. Also, I have found it best to write a relatively generic getter style function. So, in the example below I have written a function called getPortAttr - it will get attributes for the child node you name. I used your example as the defaults: "service" is the child to query and "name" is the attribute to retrieve from that child. It's a heck of a lot easier to write a function than building the longish parse strings with lots of [[this]][[and]][[that]] stuff, and it is reusable to boot.
Cheers,
Ben
library(XML)
mydoc <- '<host starttime="1365204834" endtime="1365205860">
<status state="up" reason="echo-reply" reason_ttl="127"/>
<address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
<ports>
<port protocol="tcp" portid="135">
<state state="open" reason="syn-ack" reason_ttl="127"/>
<service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10">
<cpe>cpe:/o:microsoft:windows</cpe>
</service>
</port>
<port protocol="tcp" portid="139">
<state state="open" reason="syn-ack" reason_ttl="127"/>
<service name="netbios-ssn" method="probed" conf="10"/>
</port>
</ports>
<times srtt="647" rttvar="71" to="100000"/>
</host>'
mytree<-xmlTreeParse(mydoc, useInternalNodes = TRUE)
myroot<-xmlRoot(mytree)
myports <- myroot[["ports"]]["port"]
getPortAttr <- function(x, child = "service", attr = "name") {
kid <- x[[child]]
att <- xmlAttrs(kid)[[attr]]
att
}
portNames <- sapply(myports, getPortAttr)
#> portNames
# port port
# "msrpc" "netbios-ssn"
portReason <- sapply(myports, getPortAttr, child = "state", attr = "reason")
#> portReason
# port port
#"syn-ack" "syn-ack"
Say I have a document (produced by nmap) like this:
mydoc <- '<host starttime="1365204834" endtime="1365205860"><status state="up" reason="echo-reply" reason_ttl="127"/>
<address addr="XXX.XXX.XXX.XXX" addrtype="ipv4"/>
<ports><port protocol="tcp" portid="135"><state state="open"
reason="syn-ack" reason_ttl="127"/><service name="msrpc"
product="Microsoft Windows RPC" ostype="Windows" method="probed"
conf="10"><cpe>cpe:/o:microsoft:windows</cpe></service></port>
<port protocol="tcp" portid="139"><state state="open"
reason="syn-ack" reason_ttl="127"/><service name="netbios-ssn"
method="probed" conf="10"/></port>
</ports>
<times srtt="647" rttvar="71" to="100000"/>
</host>'
I want to store this as a list of lists, so I do:
mytree<-xmlTreeParse(mydoc)
myroot<-xmlRoot(mytree)
mylist<-xmlToList(myroot)
Now my problem is that when I want to fetch the attributes of the
services running of each port, the behavior is not consistent:
mylist[["ports"]][[1]][["service"]]$.attrs["name"]
mylist[["ports"]][[2]][["service"]]$.attrs["name"]
Error in trash_list[["ports"]][[2]][["service"]]$.attrs :
$ operator is invalid for atomic vectors
I understand that the way they are dfined in the documnt is not the
same, but I think there still should be a consistent behavior. I've
tried many combination of parameters for xmlTreeParse() but nothing
has helped me. I can't find a way to call up the name of the service
consistently regardless of whether the node has children or not. Any
tips?
All the best,
S.G.
--
-------------------------------------------------------------------------------
http://barabasilab.neu.edu/people/gil/