Skip to content

[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

9 messages · Ludwig Geistlinger, Robert M. Flight, Martin Morgan +1 more

#
Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for

BRCA1; ENSG00000012048; entrez:672

via
gives me:

GRanges object with 1 range and 1 metadata column:
      seqnames               ranges strand |     gene_id
         <Rle>            <IRanges>  <Rle> | <character>
  672    chr17 [43044295, 43170403]      - |         672
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc annotation
functionality).

Thanks!
Ludwig
#
Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:

            

  
  
5 days later
#
dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all genes defined in the Ensembl database (for a given species and Ensembl release). Except the column/attribute "entrezid" that is stored in the internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb package provides functionality to generate such packages either using the Ensembl Perl API or using GTF files provided by Ensembl. If you have problems building the packages, just drop me a line and I'll do that.

cheers, jo
#
Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.<hg38>.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.<hg38>), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?

Thanks & Best,
Ludwig
#
dear Ludwig,
On 09 Jun 2015, at 10:46, Ludwig Geistlinger <Ludwig.Geistlinger at bio.ifi.lmu.de<mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>> wrote:
Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.<hg38>.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.<hg38>), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method "suggestGenomePackage" in the ensembldb package. In the long run I hope that also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will become available for all species... that would make the mapping much easier...

cheers, jo

Thanks & Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo
On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com<mailto:rflight79 at gmail.com>> wrote:
Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG00000012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))

gives me:

GRanges object with 1 range and 1 metadata column:
    seqnames               ranges strand |     gene_id
       <Rle>            <IRanges>  <Rle> | <character>
672    chr17 [43044295, 43170403]      - |         672
-------
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit f?r Bioinformatik
Institut f?r Informatik
Ludwig-Maximilians-Universit?t M?nchen
Amalienstrasse 17, 2. Stock, B?ro A201
80333 M?nchen

Tel.: 089-2180-4067
eMail: Ludwig.Geistlinger at bio.ifi.lmu.de

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
#
On 06/08/2015 11:43 PM, Rainer Johannes wrote:
Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.

## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()` 
or `GenomicFeatures::makeTxDbFromGRanges()`.

_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c("Takifugu", "release-80"))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf <- ah[["AH47101"]]
dna <- ah[["AH47477"]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance

```{r takifugi-txdb}
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtf)
````

and to use that in conjunction with the DNA sequence, e.g., to find
exon sequences of all annotated genes.

```{r takifugi-exons}
library(Rsamtools)     # for getSeq,FaFile-method
exons <- exons(txdb)
getSeq(dna, exons)
```

Some difficulties arise when working with this partly assembled genome
that require more advanced GenomicRanges skills, see the
[GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
vignettes, especially "GenomicRanges HOWTOs" and "An Introduction to
GenomicRanges".
If instead of "genes" you do "transcripts", you will see 20 different
-Robert
http://www.ncbi.nlm.nih.gov/gene/672

  
    
#
Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named "^TxDb...".

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.

Best,
Ludwig
#
dear Ludwig,
On 10 Jun 2015, at 10:29, Ludwig Geistlinger <Ludwig.Geistlinger at bio.ifi.lmu.de<mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>> wrote:
Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named "^TxDb...".

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene


the reason is that I defined the EnsDb class which is different from the TxDb class (I added some Ensembl specific informations, like gene biotype,  that are not covered by the TxDb). I just tried to implement the same functionality as for TxDb classes so that EnsDb can be integrated seamlessly in existing TxDb based workflow, just using the Ensembl annotations.
The naming convention for such packages is always: <class name>.<organism>.<version>, thus it was suggested to me to use the naming convention that I'm using at present.
For the version I opted to use the Ensembl version instead of the genome version as there are several Ensembl versions for the same genome release and the annotations can change sometimes considerably (or at least did in the past) from Ensembl version to Ensembl version. This naming also allows to have annotation packages from different Ensembl releases installed (or being used in the same R-session) and compare gene models between them etc.

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.


I agree that that would ease the mapping between BSgenome and EnsDb, but, as explained above, I would like to stick to the Ensembl version instead. I really like the AnnotationHub approach from Martin to get the appropriate DNA sequence for an Ensembl release, so, once I figured out what caused the error in the previous mail, I'll implement a method that returns the correct DNA sequence object for a given EnsDb package.

cheers, jo

Best,
Ludwig


dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo
On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com<mailto:rflight79 at gmail.com>> wrote:
Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG00000012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))

gives me:

GRanges object with 1 range and 1 metadata column:
    seqnames               ranges strand |     gene_id
       <Rle>            <IRanges>  <Rle> | <character>
672    chr17 [43044295, 43170403]      - |         672
-------
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit f?r Bioinformatik
Institut f?r Informatik
Ludwig-Maximilians-Universit?t M?nchen
Amalienstrasse 17, 2. Stock, B?ro A201
80333 M?nchen

Tel.: 089-2180-4067
eMail: Ludwig.Geistlinger at bio.ifi.lmu.de

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
1 day later
#
update: this approach bases on Martin's suggestion and I think is ideal for EnsDb or any Ensembl based annotation packages. It's not required to install an explicit genome package, but load (and eventually cache) dynamically the correct genomic sequence from Ensembl via AnnotationHub. That way chromosome names, variants etc are perfectly aligned. Also, this guarantees that the genome version and gene definitions, both from Ensembl match.

Example code (is now also included in the ensembldb vignette):

  library(AnnotationHub)
  library(EnsDb.Hsapiens.v75)
  library(Rsamtools)
  ah <- AnnotationHub()

  edb <- EnsDb.Hsapiens.v75

  ## get the Ensembl version
  eVersion <- metadata(edb)[metadata(edb)$name=="ensembl_version", "value"]
  ## query all available files for the Ensembl version
  eData <- query(ah, c(organism(edb), paste0("release-", eVersion)))
  eData

  ## retrieve the *dna.toplevel.fa file; this might take some time.
  Dna <- ah[["AH20439"]]
  ## generate an index if none is available
  if(is.na(index(Dna))){
      indexFa(Dna)
      Dna <- FaFile(path(Dna))
  }

  ## get start/end coordinates of all genes
  genes <- genes(edb)
  ## subset to all genes that are encoded on chromosomes for which
  ## we do have DNA sequence available.
  genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]
  ## get the gene sequences, i.e. the sequence including the sequence of
  ## all of the gene's exons and introns
  geneSeqs <- getSeq(Dna, genes)

cheers, jo
On 09 Jun 2015, at 11:38, Rainer Johannes <Johannes.Rainer at eurac.edu<mailto:Johannes.Rainer at eurac.edu>> wrote:
dear Ludwig,
On 09 Jun 2015, at 10:46, Ludwig Geistlinger <Ludwig.Geistlinger at bio.ifi.lmu.de<mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>> wrote:
Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC.<hg38>.knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.<hg38>), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method "suggestGenomePackage" in the ensembldb package. In the long run I hope that also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will become available for all species... that would make the mapping much easier...

cheers, jo

Thanks & Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo
On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com<mailto:rflight79 at gmail.com>> wrote:
Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
Ludwig.Geistlinger at bio.ifi.lmu.de<mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>> wrote:
Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG00000012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))

gives me:

GRanges object with 1 range and 1 metadata column:
    seqnames               ranges strand |     gene_id
       <Rle>            <IRanges>  <Rle> | <character>
672    chr17 [43044295, 43170403]      - |         672
-------
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit f?r Bioinformatik
Institut f?r Informatik
Ludwig-Maximilians-Universit?t M?nchen
Amalienstrasse 17, 2. Stock, B?ro A201
80333 M?nchen

Tel.: 089-2180-4067
eMail: Ludwig.Geistlinger at bio.ifi.lmu.de<mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>

_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



_______________________________________________
Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel