Skip to content

SAX Parser best practise

3 messages · Jan Hummel, Seth Falcon, Duncan Temple Lang

#
Dear All,

I have a question regarding best practise in setting up a XML parser
within R. 
Because I have files with more than 100 MB and I'm only interested in
some values I think a SAX-like parser using xmlEventParse() will be the
best solution.
Unfortunately the values I'm looking for, to construct some higher "mass
spectrum", are distributed over different lines: as <spectrum id="2">,
<mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio"
value="445.598999"/> (as one can see in the xml snip set)

I know the mechanism of using Event Handlers, as shown in the examples.
But what I'm looking for is, how can I use some "path information" as
mentioned in "addContext" parameter of xmlEventParse()? May somebody
share a example using "addContext = TRUE" and pointing me to the
variables I may use if I implement the "..." parameter within my
handlers.

Do I have to implement a "status machine" using some variables within my
handlers, or would one prefer to use the "state" parameter of
xmlEventParse()?

I would appreciate any assistance very much!
	Jan
#
Hi Jan,
On 20 Sep 2005, Hummel at mpimp-golm.mpg.de wrote:
I missed the xml snip, but I think I get the gist of your question.
I'm not familiar with the addContext arg and don't know whether or not
that provides another solution to your problem.

I do know that you can do what you want by writing "state machine"
code.  I played a little with using the state arg for this purpose,
but ran into some problems (sorry, no details in my memory banks).

There is an example of the state approach in Bioconductor's AnnBuilder
package.  See R/GO.R.  It isn't the prettiest or best example, but
maybe it will help get you going.

The general approach is to use '<<-' to reach up a level and set the
state variables from inside the tag handlers.

HTH,

+ seth
#
Jan Hummel wrote:
The addContext was an attempt to provide contextual information,
but it is not obvious how to do this efficiently. And of course
efficiency is the name of the game with the SAX model.
If we wanted to know path information for the node, we would have
to build this and that would slow things down. There are no nodes
in the SAX world as we don't build the tree in any way.
So the addContext currently doesn't do much. It is there
as a hook that we can use if we want in the future.
But you can do anything you need in the R code.
As Seth mentioned in his reply, you can use state in your R handler 
functions to determine where you are.  You can maintain a "stack"
to determine the exact path of the current "node" in the
startElement() handler and pop the name in the endElement() handler.

The difference between maintaining state via environments/local 
persisten scope (using <<- in Seth's mail) and using the state argument
is more of a personal preference in R.  The state argument was added
for S-Plus since it does not support environments.  Using the state
argument might save an epsilon amount of time, but it is hopefully
neglible.

BTW, do you have a schema for the XML document you are working on?