Skip to content

[Bioc-devel] RFC: Naming scheme for organism level annotation data packages

9 messages · Wolfgang Huber, Sean Davis, Seth Falcon +1 more

#
Hello all,

We are working on new and improved versions of humanLLmappings (along
with rat and mouse).  The contents will be similar, but we are making
some significant changes.  In particular, we are trying to make the
data maps as similar as possible to those found in the common
Affymetrix chip-based packages.  This will make programatic use of the
packages easier.

For human, mouse, and rat, the central ID will be Entrez Gene.  This
will not be the case for all organism level packages,
e.g. S. cerevisiae where EG is not the ID chosen by the research
community.  Therefore, we propose the following naming scheme for new
organism level annotation data packages:

    org.<organism>.db

where <organism> is the UniGene organism abbreviation [1].  To start
with, then, we will have:

   org.Hs.db
   org.Mm.db
   org.Rn.db

The 'org' prefix identifies the package as organism wide and will make
it easy for these packages to sort next to each other.  Using UniGene
organism abbreviations gives us a short, specific, and reliable
abbreviation.  The 'db' suffix indicates that these packages will be
backed by a DB (SQLite) and use the AnnotationDbi interface.

One possible downside is that if an alternative primary ID emerges
(e.g. an ensembl based) then we would need to add a way to
distinguish.  But we felt it was easier to cross that bridge when we
get there.

Comments?  Suggestions?  Concerns?  Send them along.  If we don't hear
anything by next Wednesday 25 July, we will move forward with this
proposal.

Best,

+ seth

[1] There is probably a more graceful way, but you can find an
abbreviation by browsing here__ and clicking on the number in the
right column in the row for the organism of interest.
#
Hi Seth,

sounds good to me.

One possible option I wanted to throw into the ring to solve the 
identifier system problem and at the same be at least conceptually 
prepared for annotations of multi-species systems (e.g. host-pathogen, 
say, man/anopheles/plasmodium) would be to use name of the name of 
identifier system (EG) as the prefix rather than "org".

  Best wishes
	Wolfgang

  Falcon ha scritto:

  
    
#
Wolfgang Huber <huber at ebi.ac.uk> writes:
That was something we discussed.  The down sides of that are:

  - What would you put for an updated version of the YEAST package?

  - How would you indentify organism-level packages?  [Perhaps your
    point is that this may not really be all that useful so isn't
    worth considering].

+ seth
#
Seth Falcon wrote:
Since Seth et al. have produced a wonderfully useful db-based system, it 
seems that these data packages could be much more flexible from an ID 
point of view.  One has a primary ID associated with the data package, 
but mappings, to the extent that they are available, could also be 
included.  Then, you could have something like:

primaryKey(org.Hs.mappings)
[1] "EntrezGene"

availableKeys(org.Hs.mappings)
    KeyType   ExampleValue
[1] "EntrezGene"   9923
[2] "EnsemblGene"   ENSG00000273213
[3] "HUGOSymbol"   BRCA1

And tools for getting data:

mget(mykeys, org.Hs.mappingsSYMBOL) #expects mykeys to be EntrezGene

mget(mykeys, org.Hs.mappingsSYMBOL,keytype="EnsemblGene") #does lookup 
of EnsemblGene to EntrezGene and then does the mget
# under the hood, this is a simple join in sql

Software using such annotation packages automatically becomes hugely 
more powerful.  Alternatively, a MAPPINGENVIRONMENT could be included 
that could do the up-front mapping from one ID type to the primary key 
(and back again) and then software could remain largely unchanged from 
the current situation (assuming there is a single primary key).

The reasons that I like this approach are:
1) Each organism package then need be created only once and the 
expectation would be that most of the appropriate mappings would be 
included.
2) Standardizes mappings between ID types--individual users can rely on 
a standard mapping with version information (Nothing worse than an 
external mapping source "updating" halfway through a project)
3) Allows one "pipeline" for the production of the annotation and 
primary keys, while allowing flexibility in the production of secondary 
mappings (an arbitrary number of mappings can be added; one could even 
imagine allowing users to add their own mappings quite easily to the 
database with a single function)
4) Software immediately becomes more useful without much increased 
complexity
5) Could be extended to have multiple primary keytypes in the same data 
package with automatic key conversions.

Of course, some attention would need to be paid to documenting the 
source of the alternative mappings, but the alternative mappings are 
readily available.  With the adoption of a sql backend for these 
packages, all of this becomes doable with the adoption of a single table 
(or two, if one includes the "availableKey" information in a separate 
table--a good idea, in my opinion) and some infrastructure for doing the 
lookups (API level infrastructure, since the backend is a simple join).

All of this said, I am not so intimately involved to know how much work 
this would actually entail, but I think since we are talking about 
making changes, it is worthwhile entertaining various options.

Sean
#
Dear Seth,
how about sgd.sc.db or sc.sgd.db or just sc.db?
given that SGD is responsible for the systematic names of the genes in 
the S.cerevisiae genome:
   http://www.yeastgenome.org/help/yeastGeneNomenclature.shtml

yeast is a very simple and clean example compared to the state of 
affairs in many other species and associated scientific communities, so 
in that case stating "SGD" might be unnecessary, due to its undisputed 
central role. So it may not necessarily be the best example for us for 
deciding how to do things in general.

In Sc, it is much more apparent what a "gene" is: in other organisms, 
the mapping between what are the actual proteins in the cell and the 
loci on the DNA where they are transcribed from is more complex, in S 
cerevisiae it is so simple that one often doesn't even think about it.

   Best wishes
	Wolfgang
#
Wolfgang Huber <huber at ebi.ac.uk> writes:
Those are fine suggestions.  Do you (or others) think it is worth
having a common prefix for all organism level packages?

   org.sgd.Sc.db
   org.eg.Hs.db
   org.eg.Mm.db

Since we don't (yet) have a super fancy dynamic webapp that allows
users to slice and dice their package display, there is some immediate
benefit to having similar packages sort next to each other.

+ seth
#
wouldn't the sorting be more informative if we had, for example

org.Hs.eg.db -- entrez gene
org.Hs.en.db  -- ensembl, maybe

i.e., provider would nest within organism.  if it is not intended
to have multiple providers, take it out

org.Hs.db
org.Sc.db

i would expect to scan for organism first, then for provider within
organism if we are going to distinguish these.

if the justification for the "eg" is clear in earlier posts, disregard
this ... i did not read them closely enough to remember now.
#
Vincent Carey 525-2265 <stvjc at channing.harvard.edu> writes:
Yes.
Well, we don't have multiple providers at this time [*] and it isn't clear
whether, for exampple, en based data could simply be integrated into
the same DB.  So it is simply not clear whether including the provider
in the name is a good idea.

[*] We do have different providers for different organisms.

So I think the revised choice is between your two suggestions:

   org.Hs.eg.db   and in general org.<org>.<provider>.db

   OR

   org.<org>.db


+ seth
#
Hi Sean,

Sean Davis <sdavis2 at mail.nih.gov> writes:
Interesting.  Although I can see how this would work from a DB point
of view, it isn't clear to me that such a combined packge would be
feasible/desirable.  If the IDs are more or less different names for
the same things, then no problem.  But if a new ID induces an entirely
new mapping of all the downstream relations, well, the resulting DB
size could be prohibitive.

Your pseudocode suggests the notion of a package-level object
"org.Hs.mappings".  That isn't something we've implemented in
AnnotationDbi, but I like the idea.

I'd like to point out that we have a number of the SQLite-based
annotation data packages available in devel and this would be a great
time for interested parties to give them a try and send us feedback.

The packages should work as drop-in replacements for the
environment-based packages.  There are some additional features which
currently are only documented in the AnnotationDbi vignette.
It seems to me that this only works if the IDs are nearly equivalent.
If not, each "primary ID" needs to be deeply involved in the process
of creating the DB tables.
Let me know if I'm misunderstanding, but here I think you are
describing a system that would define a mapping, say, from enseml to
EG and it isn't clear to me that this is what someone wanting ensembl
annotation would really want -- it would allow them to work with
ensembl IDs, but using EG annotation.

Best Wishes,

+ seth