[Bioc-devel] Extracting mzR compatible acquisition number from mzIdentML files

Hi

I'm in the final stage of preparing an mzIdentML parser for submission to Bioconductor (https://github.com/thomasp85/mzID) The parser is intended to be quite sparse and not interpret the content of the mzIdentML file that much.

One feature I would like to include though, is that each scan gets annotated with an mzR compatible acquisition number for better interoperability between the two parsers.

The HUPO specifications for the mzIdentML format specifies that each scan in the file is labelled with a spectrumID and a reference to the ms data file. Furthermore each ms data file should have a spectrum ID format specified according to the controlled vocabulary.

The content of the spectrumID can thus be either e.g. 'scanID=<someInteger>' , 'spectrum=<someInteger>', 'scan=<someInteger>' or even more elaborate: 'sample=<someInteger> period=<someInteger> cycle=<someInteger> experiment=<someInteger>', depending on the machine producing the ms data.

When an ms data file gets parsed by mzR it is all conveniently dropped and replaced by an acquisitionNum, that uniquely identifies the scan. This is quite easy to handle for spectrumID's consisting of only e.g. 'scan=<someInteger>' but for spectrumID's with more than one identifier it gets a bit more fuzzy and I don't like guessing.

So the question is: How can I ensure that I extract the right value from the spectrumID for an mzR compatible acquisitionNum? I realize that the generation of the acquisitionNum in mzR is probably handled by the RAMP module, but I hope some of the mzR folks (or others) can help.

best

Thomas Pedersen, PhD student at the Technical University of Denmark (DTU)