Dear All, I have a question regarding best practise in setting up a XML parser within R. Because I have files with more than 100 MB and I'm only interested in some values I think a SAX-like parser using xmlEventParse() will be the best solution. Unfortunately the values I'm looking for, to construct some higher "mass spectrum", are distributed over different lines: as <spectrum id="2">, <mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio" value="445.598999"/> (as one can see in the xml snip set) I know the mechanism of using Event Handlers, as shown in the examples. But what I'm looking for is, how can I use some "path information" as mentioned in "addContext" parameter of xmlEventParse()? May somebody share a example using "addContext = TRUE" and pointing me to the variables I may use if I implement the "..." parameter within my handlers. Do I have to implement a "status machine" using some variables within my handlers, or would one prefer to use the "state" parameter of xmlEventParse()? I would appreciate any assistance very much! Jan
SAX Parser best practise
3 messages · Jan Hummel, Seth Falcon, Duncan Temple Lang
Hi Jan,
On 20 Sep 2005, Hummel at mpimp-golm.mpg.de wrote:
I have a question regarding best practise in setting up a XML parser within R. [snip] value="445.598999"/> (as one can see in the xml snip set)
I missed the xml snip, but I think I get the gist of your question.
I know the mechanism of using Event Handlers, as shown in the examples. But what I'm looking for is, how can I use some "path information" as mentioned in "addContext" parameter of xmlEventParse()? May somebody share a example using "addContext = TRUE" and pointing me to the variables I may use if I implement the "..." parameter within my handlers. Do I have to implement a "status machine" using some variables within my handlers, or would one prefer to use the "state" parameter of xmlEventParse()?
I'm not familiar with the addContext arg and don't know whether or not that provides another solution to your problem. I do know that you can do what you want by writing "state machine" code. I played a little with using the state arg for this purpose, but ran into some problems (sorry, no details in my memory banks). There is an example of the state approach in Bioconductor's AnnBuilder package. See R/GO.R. It isn't the prettiest or best example, but maybe it will help get you going. The general approach is to use '<<-' to reach up a level and set the state variables from inside the tag handlers. HTH, + seth
Jan Hummel wrote:
Dear All, I have a question regarding best practise in setting up a XML parser within R. Because I have files with more than 100 MB and I'm only interested in some values I think a SAX-like parser using xmlEventParse() will be the best solution. Unfortunately the values I'm looking for, to construct some higher "mass spectrum", are distributed over different lines: as <spectrum id="2">, <mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio" value="445.598999"/> (as one can see in the xml snip set) I know the mechanism of using Event Handlers, as shown in the examples. But what I'm looking for is, how can I use some "path information" as mentioned in "addContext" parameter of xmlEventParse()? May somebody share a example using "addContext = TRUE" and pointing me to the variables I may use if I implement the "..." parameter within my handlers.
The addContext was an attempt to provide contextual information, but it is not obvious how to do this efficiently. And of course efficiency is the name of the game with the SAX model. If we wanted to know path information for the node, we would have to build this and that would slow things down. There are no nodes in the SAX world as we don't build the tree in any way. So the addContext currently doesn't do much. It is there as a hook that we can use if we want in the future. But you can do anything you need in the R code.
Do I have to implement a "status machine" using some variables within my handlers, or would one prefer to use the "state" parameter of xmlEventParse()?
As Seth mentioned in his reply, you can use state in your R handler functions to determine where you are. You can maintain a "stack" to determine the exact path of the current "node" in the startElement() handler and pop the name in the endElement() handler. The difference between maintaining state via environments/local persisten scope (using <<- in Seth's mail) and using the state argument is more of a personal preference in R. The state argument was added for S-Plus since it does not support environments. Using the state argument might save an epsilon amount of time, but it is hopefully neglible. BTW, do you have a schema for the XML document you are working on?
I would appreciate any assistance very much! Jan
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html