I think these are all good observations and we may benefit from a wider
discussion on the support site?
the abandonment of knownGene seems to have clear implications for changing
our most visible txdb
examples. what should we change to? can we make a more future-proof
design for these annotation selections?
On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:
hi,
On 01/11/2016 04:07 PM, Vincent Carey wrote:
[...]
Is it true that there is an asymmetry between Entrez gene ID and Ensembl
gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as
keytypes. My question is whether this "anchor" concept
holds in the current infrastructure.
you're right that the infrastructure is probably symmetric at least
between Entrez and Ensembl, so maybe i'm not using the term "anchor"
correctly here, i'm just referring to the fact that many package
and use cases of BioC are based in, or illustrated, using Entrez IDs.
examples are:
head(org.Hs.eg.db::keys(org.Hs.eg.db))
[1] "1" "2" "3" "9" "10" "11"
i.e., by default the 'keytype' is 'ENTREZID'
genefilter::nsFilter() argument 'require.entrez' filters out features
without an Entrez Gene ID annotation.
Category::categoryToEntrezBuilder() returns a list mapping category ids
the Entrez Gene ids annotated at the cateogry id.
SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a
keytype to map ranges to genes. By default the keytype is 'ENTREZID'
some of the workflows are also based on Entrez IDs, such as:
http://www.bioconductor.org/help/workflows/variants
so if the user just replaces the txdb object in one of those examples or
argument functions by a txdb object that does not have Entrez identifiers
as primary gene key, those functions, examples or workflows will require
modification. this is not necessarily bad, but may put more burden on the
user who is learning with a "default" TxDb human gene annotation package.
this has been so far the *.UCSC.knownGene using Entrez as gene
given the apparent discontinuity of UCSC with the known gene track, i
suggest to put available at the BioC site another default gene annotation
package, but then one based on Entrez identifiers given the amount of
legacy code and documentation using Entrez in one way or another.
an alternative to translating the default Ensembl Gencode identifiers
Entrez would be to just take the NCBI RefSeq annotations as human gene
annotation package available by default, i.e., replacing current
*.UCSC.knownGene by *.UCSC.refGene
robert.
[[alternative HTML version deleted]]