Skip to content

[Bioc-devel] arabidopsis annotations

3 messages · nli at fhcrc.org, Hervé Pagès

#
Hi Bioc-developpers,

In the process of migrating the arabidopsis annotations to the new sqlite-based
infrastructure, we found a problem with the current ENZYME/ENZYME2PROBE maps.
We'd like to know what you think (especially if you've been using these maps).

In the ag and ath1121501 packages the ENZYME/ENZYME2PROBE maps are linking probe ids
to enzyme names, and not to EC numbers like in _all_ other chip-based packages.
In addition the man pages for those maps are incorrect: they claim that those 2 maps
are between manufacturer ids and EC numbers (not really a surprise in fact because
AnnBuilder uses the same template as for any other packages to generate the
ENZYME/ENZYME2PROBE man pages).

This is not a satisfying situation and we'd like to improve things a little
bit for the upcoming ag.db and ath1121501.db packages. There are of course different
ways we could address the problem:

  A. just fix the man pages:
     - pro: easy and 100% compatible with the current (environment-based) ag and
            ath1121501 packages
     - con: for arabidopsis, the ENZYME/ENZYME2PROBE maps will remain different
            from what they are in all other chip-based packages + people that
            want the EC numbers still don't have them

  B. fix the ENZYME/ENZYME2PROBE maps so that they are consistent with all
     other ENZYME/ENZYME2PROBE maps
     - pro: consistency across all other chip-based packages
     - con: enzyme names are gone so the user code using the ENZYME/ENZYME2PROBE maps
            from ag and ath1121501 will need to be modified to work with ag.db and
            ath1121501.db

  C. rename the ENZYME/ENZYME2PROBE maps -> ECNAME/ECNAME2PROBE and deprecate the
     ENZYME/ENZYME2PROBE maps
     - pro: use the standard deprecation procedure for a smooth transition period
     - con: people that want the EC numbers right now still don't have them (they'll
            need to wait BioC 2.2)

  D. fix the ENZYME/ENZYME2PROBE maps and add 2 new maps (e.g. ECNAME/ECNAME2PROBE)
     for the mapping between probe ids and enzyme names
     - pro: consistency and completeness
     - con: the user code using the ENZYME/ENZYME2PROBE maps from ag and ath1121501
            will need to use the ECNAME/ECNAME2PROBE maps instead (but here the
            impact on the user is not as bad as with B since the data they
            have been using so far is still available but under different names)

  E. anything else?

Thanks for your feedback!

H.
#
Hi, Herve,

I feel this is more of a data source problem than a data value problem. The
reason that we have this inconsistency in ag and ath1121501 is because we
extract enzyme information from AraCyc rather than from KEGG. KEGG provides
EC numbers but AraCyc only provides enzyme names. I tried to suggest using KEGG
instead of AraCyc when I updated AthPkgBuilder last year, but only get half way
through: we added KEGG pathway annotation to the package but still keep AraCyc
pathway data (post link:
http://article.gmane.org/gmane.science.biology.informatics.conductor/9527/match=arabidopsis
). Maybe you can use a similar solution: add KEGG enzyme annotation and rename
AraCyc enzyme annotation into a different object. 

I would also like to suggest posting this question on bioc so that you get a
bigger audience group. 

hope this helps

nianhua

Quoting Herve Pages <hpages at fhcrc.org>:
3 days later
#
Hi Nianhua,
nli at fhcrc.org wrote:
In addition to the data source problem, there is a map naming problem.
Thanks for the feedback. After reading it we decided to go for solution D. i.e.
to provide both mappings (probes <-> enzyme names and probes <-> EC numbers).
The data currently in the ENZYME map (probes <-> enzyme names) will be moved
to new ARACYCENZYME map and from now the ENZYME map will contain
the "probes <-> EC numbers" mapping.

Cheers,
H.