Skip to content

[Bioc-devel] Txdb Issues - all exon names are NA's ?

6 messages · Arora, Sonali, Hervé Pagès, Marc Carlson

#
Hi everyone,

I was trying to get the exons by gene from a txdb object but I get NA's 
for all exon_name's. Please advise.

 > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
 > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
 > ebg2 <- exonsBy(txdb, by="gene")
 >
 > ebg2
GRangesList object of length 23459:
$1
GRanges object with 15 ranges and 2 metadata columns:
        seqnames               ranges strand   |   exon_id
           <Rle>            <IRanges>  <Rle>   | <integer>
    [1]    chr19 [58858172, 58858395]      -   |    250809
    [2]    chr19 [58858719, 58859006]      -   |    250810
    [3]    chr19 [58859832, 58860494]      -   |    250811
    [4]    chr19 [58860934, 58862017]      -   |    250812
    [5]    chr19 [58861736, 58862017]      -   |    250813
    ...      ...                  ...    ... ...       ...
   [11]    chr19 [58868951, 58869015]      -   |    250821
   [12]    chr19 [58869318, 58869652]      -   |    250822
   [13]    chr19 [58869855, 58869951]      -   |    250823
   [14]    chr19 [58870563, 58870689]      -   |    250824
   [15]    chr19 [58874043, 58874214]      -   |    250825
          exon_name
        <character>
    [1]        <NA>
    [2]        <NA>
    [3]        <NA>
    [4]        <NA>
    [5]        <NA>
    ...         ...
   [11]        <NA>
   [12]        <NA>
   [13]        <NA>
   [14]        <NA>
   [15]        <NA>

$10
GRanges object with 2 ranges and 2 metadata columns:
       seqnames               ranges strand | exon_id exon_name
   [1]     chr8 [18248755, 18248855]      + |  113603      <NA>
   [2]     chr8 [18257508, 18258723]      + |  113604      <NA>

...
<23457 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
 > testgr <- unlist(ebg2)
 > table(is.na(mcols(testgr)$exon_name))

   TRUE
272776
 > sessionInfo()
R version 3.2.2 RC (2015-08-09 r68965)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils
[7] datasets  methods   base

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.1
[2] GenomicFeatures_1.21.29
[3] AnnotationDbi_1.31.18
[4] Biobase_2.29.1
[5] GenomicRanges_1.21.28
[6] GenomeInfoDb_1.5.16
[7] IRanges_2.3.21
[8] S4Vectors_0.7.18
[9] BiocGenerics_0.15.6

loaded via a namespace (and not attached):
  [1] XVector_0.9.4              zlibbioc_1.15.0
  [3] GenomicAlignments_1.5.17   BiocParallel_1.3.52
  [5] tools_3.2.2                SummarizedExperiment_0.3.9
  [7] DBI_0.3.1                  lambda.r_1.1.7
  [9] futile.logger_1.4.1        rtracklayer_1.29.27
[11] futile.options_1.0.0       bitops_1.0-6
[13] RCurl_1.95-4.7             biomaRt_2.25.3
[15] RSQLite_1.0.0              Biostrings_2.37.8
[17] Rsamtools_1.21.17          XML_3.98-1.3
#
Hi Sonali,

UCSC doesn't provide names for the exons of their gene models.
See the table where this data is coming from:

 
https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema

The exon information is all coming from the exonStarts and exonEnds
columns. No exon names!

H.

PS: Maybe this would better be asked on the support site.
On 09/22/2015 04:44 PM, Arora, Sonali wrote:

  
    
#
Herve is right. UCSC doesn't give us this information,  And actually, I
think it's pretty rare to see exon names from anybody.   So it's weird to
me that they are a default return value for this method.

  Marc
On Tue, Sep 22, 2015 at 5:29 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:

            

  
  
#
I was following Mike's RNAseq workflow from here
http://www.bioconductor.org/help/workflows/rnaseqGene/

and it had exon_name's - but that's probably because the txdb is made 
from NCBI (GrCh37.75)

Thanks for the clarification Herve and Marc!

Sonali.
On 9/22/2015 5:39 PM, Marc Carlson wrote:

  
    
#
Hi Marc,
On 09/22/2015 05:39 PM, Marc Carlson wrote:
Ensembl does provide exon names/ids so any TxDb object that was created
with makeTxDbFromBiomart("ensembl", ...) should have them:

   library(GenomicFeatures)
   txdb <- makeTxDbFromBiomart("ensembl", dataset="celegans_gene_ensembl")
   exonsBy(txdb, use.names=TRUE)$Y74C9A.2a.2
   # GRanges object with 4 ranges and 3 metadata columns:
   #       seqnames         ranges strand |   exon_id          exon_name 
exon_rank
   #          <Rle>      <IRanges>  <Rle> | <integer>        <character> 
<integer>
   #   [1]        I [10413, 10585]      + |         1  WBGene00022276.e1 
         1
   #   [2]        I [11618, 11689]      + |         6  WBGene00022276.e6 
         2
   #   [3]        I [14951, 15160]      + |        11 WBGene00022276.e11 
         3
   #   [4]        I [16473, 16842]      + |        14 WBGene00022276.e14 
         4
   #   -------
   #   seqinfo: 7 sequences (1 circular) from an unspecified genome

Note that the *By() extractors don't let the user choose which column
to return at the moment so that's why it was decided (a long time ago)
to return exon internal ids *and* names (better more than less).

H.

  
    
#
Works for me.

 Marc
On Tue, Sep 22, 2015 at 6:03 PM, Herv? Pag?s <hpages at fredhutch.org> wrote: