Skip to content

[Bioc-devel] exonsBy dropping genes from TxDb

7 messages · Dario Strbenac, Mike Smith, Hervé Pagès +1 more

#
Dear bioc-devel,

I noticed exonsBy is dropping a lot of genes when run on a TxDb object
created with makeTxDbFromBiomart (see below). Please also see related post
on the Bioconductor support site:

https://support.bioconductor.org/p/101951/#102160

Thanks for your help.

Leonard

--
[1] 63967
[1] 36751
GRangesList object of length 1:
$ENSG00000136997
GRanges object with 9 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id         tx_name
         <Rle>              <IRanges>  <Rle> | <integer>     <character>
  [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
  [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
  [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
  [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
  [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
  [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
  [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
  [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
  [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome
GRangesList object of length 0:
<0 elements>

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2
[4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               XVector_0.16.0
 [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0
 [5] BiocParallel_1.10.1        bit_1.1-12
 [7] lattice_0.20-35            rlang_0.1.2
 [9] blob_1.1.0                 tools_3.4.1
[11] grid_3.4.1                 SummarizedExperiment_1.6.5
[13] DBI_0.7                    matrixStats_0.52.2
[15] bit64_0.9-7                digest_0.6.12
[17] tibble_1.3.4               Matrix_1.2-11
[19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6
[21] bitops_1.0-6               RCurl_1.95-4.8
[23] biomaRt_2.32.1             memoise_1.1.0
[25] RSQLite_2.0                DelayedArray_0.2.7
[27] compiler_3.4.1             Rsamtools_1.28.0
[29] Biostrings_2.44.2          XML_3.98-1.9
[31] pkgconfig_2.0.1

  
  
#
Hi Leonard,

Sorry for missing your earlier posts about this. Will look into it.

Thanks,
H.
On 10/27/2017 09:07 AM, Leonard Goldstein wrote:

  
    
#
The problem seems to originate in the BioMart service itself:

   library(biomaRt)
   mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

   attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
   df1 <- getBM(attributes1, mart=mart)

   dim(df1)
   # [1] 1150974       2

   subset(df1, ensembl_transcript_id == "ENST00000485971")
   #         ensembl_transcript_id ensembl_exon_id
   # 1078905       ENST00000485971 ENSE00001952391
   # 1078906       ENST00000485971 ENSE00003639486
   # 1078907       ENST00000485971 ENSE00001881546

Correct. ENST00000485971 has 3 exons.

But if I request more attributes:

   attributes2 <- c(attributes1, "rank", "genomic_coding_start", 
"cds_start", "5_utr_start")
   df2 <- getBM(attributes2, mart=mart)

   dim(df2)
   # [1] 1051158       6

Uh, less rows?

   subset(df2, ensembl_transcript_id == "ENST00000485971")
   # [1] ensembl_transcript_id ensembl_exon_id       rank
   # [4] genomic_coding_start  cds_start             5_utr_start
<0 rows> (or 0-length row.names)

ENST00000485971 disappeared from the result!

It's like someone replaced a LEFT JOIN with an INNER JOIN somewhere
in the BioMart service.

The impact of this on makeTxDbFrombiomart() is that many exons indeed
don't make it into the TxDb object so we end up with many exon-less
transcripts. I'll add a sanity check and will error when this happens. 
Better fail with an informative error message than silently return
a broken TxDb.

Before I contact the BioMart people about this, I'd like to remove R
and the biomaRt package from the equation but I'm not sure how to do
that (I was not able to reproduce the problem by using their web
interface https://www.ensembl.org/biomart). Any help with this would
be greatly appreciated.

Thanks,
H.

 > sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.4.2-bioc35/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.4.2-bioc35/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.32.1       BiocInstaller_1.26.1

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.13         IRanges_2.10.5       XML_3.98-1.9
  [4] digest_0.6.12        bitops_1.0-6         DBI_0.7
  [7] stats4_3.4.2         RSQLite_2.0          rlang_0.1.2
[10] blob_1.1.0           S4Vectors_0.14.7     tools_3.4.2
[13] bit64_0.9-7          Biobase_2.36.2       RCurl_1.95-4.8
[16] bit_1.1-12           parallel_3.4.2       compiler_3.4.2
[19] BiocGenerics_0.22.1  AnnotationDbi_1.38.2 memoise_1.1.0
[22] tibble_1.3.4
On 10/27/2017 10:05 AM, Herv? Pag?s wrote:

  
    
#
Good day,

I stepped through the code until execution reached the end of postForm in RCurl which is called by getBM and obtains the textual result from the server. If I check the contents of write$value(), the example missing transcript is not there.

Browse[3]> grep("ENST00000485971", write$value())
integer(0)

write$value is a weird function. It's prototype is function (collapse = "", ...) but its body contains code such as

    if (is.null(collapse)) 
        return(txt)

I wonder where txt is created. It's not passed as an extra variable.

Browse[7]> print(list(...))
list()

Searching the R code reveals that txt is created as a global variable in another function named dynCurlReader by the code statement txt <<- character().

RCurl also uses functions that don't begin with a dot but are undocumented.

ans = encode(ans)
Browse[7]> ?encode
No documentation for ?encode? in specified packages and libraries

Anyway, the transcript ID is also missing from txt.

Browse[7]> grep("ENST00000485971", txt)
integer(0)

It's hard to know what the obfuscated code of RCurl is doing.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
#
My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You can
check this with the code at
https://gist.github.com/grimbough/7e7a47b7a4f64915220ce35cc1ce8f39


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're applying
no filter at all.  Problems related to this have been reported before (
https://support.bioconductor.org/p/86358/) and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes
[1] TRUE

And some of these transcripts are clearly not present in the first query:
"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:

  
  
1 day later
#
Thanks Dario and Mike for looking into this.

In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():

   > library(GenomicFeatures)

   > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", 
server="useastdb.ensembl.org"))
   Fetch transcripts and genes from Ensembl ... OK
   Fetch exons and CDS from Ensembl ... OK
   Fetch chromosome names and lengths from Ensembl ...OK
   Gather the metadata ... OK
   Make the TxDb object ... OK
      user  system elapsed
    46.420   1.271  58.270

   > txdb
   TxDb object:
   # Db type: TxDb
   # Supporting package: GenomicFeatures
   # Data source: Ensembl
   # Organism: Homo sapiens
   # Ensembl release: 90
   # Ensembl database: homo_sapiens_core_90_38
   # MySQL server: useastdb.ensembl.org
   # transcript_nrow: 220144
   # exon_nrow: 757782
   # cds_nrow: 789614
   # Db created by: GenomicFeatures package from Bioconductor
   # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
   # GenomicFeatures version at creation time: 1.29.16
   # RSQLite version at creation time: 2.0
   # DBSCHEMAVERSION: 1.2

It's in GenomicFeatures 1.29.16.

Cheers,
H.
On 10/28/2017 03:32 PM, Mike Smith wrote:

  
    
#
Thanks Herv? and others for looking into this.

Leonard
On Sun, Oct 29, 2017 at 10:19 PM, Herv? Pag?s <hpages at fredhutch.org> wrote: