[Bioc-devel] RFC: Naming scheme for organism level annotation data packages

Hello all,

We are working on new and improved versions of humanLLmappings (along
with rat and mouse).  The contents will be similar, but we are making
some significant changes.  In particular, we are trying to make the
data maps as similar as possible to those found in the common
Affymetrix chip-based packages.  This will make programatic use of the
packages easier.

For human, mouse, and rat, the central ID will be Entrez Gene.  This
will not be the case for all organism level packages,
e.g. S. cerevisiae where EG is not the ID chosen by the research
community.  Therefore, we propose the following naming scheme for new
organism level annotation data packages:

    org.<organism>.db

where <organism> is the UniGene organism abbreviation [1].  To start
with, then, we will have:

   org.Hs.db
   org.Mm.db
   org.Rn.db

The 'org' prefix identifies the package as organism wide and will make
it easy for these packages to sort next to each other.  Using UniGene
organism abbreviations gives us a short, specific, and reliable
abbreviation.  The 'db' suffix indicates that these packages will be
backed by a DB (SQLite) and use the AnnotationDbi interface.

One possible downside is that if an alternative primary ID emerges
(e.g. an ensembl based) then we would need to add a way to
distinguish.  But we felt it was easier to cross that bridge when we
get there.

Comments?  Suggestions?  Concerns?  Send them along.  If we don't hear
anything by next Wednesday 25 July, we will move forward with this
proposal.

Best,

+ seth

[1] There is probably a more graceful way, but you can find an
abbreviation by browsing here__ and clicking on the number in the
right column in the row for the organism of interest.