[Bioc-devel] RFC: Naming scheme for organism level annotation data packages
Hi Sean, Sean Davis <sdavis2 at mail.nih.gov> writes:
Since Seth et al. have produced a wonderfully useful db-based system, it seems that these data packages could be much more flexible from an ID point of view. One has a primary ID associated with the data package, but mappings, to the extent that they are available, could also be included. Then, you could have something like: primaryKey(org.Hs.mappings) [1] "EntrezGene" availableKeys(org.Hs.mappings) KeyType ExampleValue [1] "EntrezGene" 9923 [2] "EnsemblGene" ENSG00000273213 [3] "HUGOSymbol" BRCA1
Interesting. Although I can see how this would work from a DB point of view, it isn't clear to me that such a combined packge would be feasible/desirable. If the IDs are more or less different names for the same things, then no problem. But if a new ID induces an entirely new mapping of all the downstream relations, well, the resulting DB size could be prohibitive. Your pseudocode suggests the notion of a package-level object "org.Hs.mappings". That isn't something we've implemented in AnnotationDbi, but I like the idea. I'd like to point out that we have a number of the SQLite-based annotation data packages available in devel and this would be a great time for interested parties to give them a try and send us feedback. The packages should work as drop-in replacements for the environment-based packages. There are some additional features which currently are only documented in the AnnotationDbi vignette.
The reasons that I like this approach are: 1) Each organism package then need be created only once and the expectation would be that most of the appropriate mappings would be included.
It seems to me that this only works if the IDs are nearly equivalent. If not, each "primary ID" needs to be deeply involved in the process of creating the DB tables.
2) Standardizes mappings between ID types--individual users can rely on a standard mapping with version information (Nothing worse than an external mapping source "updating" halfway through a project) 3) Allows one "pipeline" for the production of the annotation and primary keys, while allowing flexibility in the production of secondary mappings (an arbitrary number of mappings can be added; one could even imagine allowing users to add their own mappings quite easily to the database with a single function) 4) Software immediately becomes more useful without much increased complexity 5) Could be extended to have multiple primary keytypes in the same data package with automatic key conversions.
Let me know if I'm misunderstanding, but here I think you are describing a system that would define a mapping, say, from enseml to EG and it isn't clear to me that this is what someone wanting ensembl annotation would really want -- it would allow them to work with ensembl IDs, but using EG annotation. Best Wishes, + seth
Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center http://bioconductor.org