[Bioc-devel] exonsBy dropping genes from TxDb

7 messages · Dario Strbenac, Mike Smith, Hervé Pagès +1 more

Original

1

7

Leonard Goldstein

Fri, Oct 27, 2017 9:07 AM #

Dear bioc-devel,

I noticed exonsBy is dropping a lot of genes when run on a TxDb object
created with makeTxDbFromBiomart (see below). Please also see related post
on the Bioconductor support site:

https://support.bioconductor.org/p/101951/#102160

Thanks for your help.

Leonard

--

[1] 63967

[1] 36751

GRangesList object of length 1:
$ENSG00000136997
GRanges object with 9 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id         tx_name
         <Rle>              <IRanges>  <Rle> | <integer>     <character>
  [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
  [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
  [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
  [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
  [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
  [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
  [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
  [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
  [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

GRangesList object of length 0:
<0 elements>

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2
[4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               XVector_0.16.0
 [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0
 [5] BiocParallel_1.10.1        bit_1.1-12
 [7] lattice_0.20-35            rlang_0.1.2
 [9] blob_1.1.0                 tools_3.4.1
[11] grid_3.4.1                 SummarizedExperiment_1.6.5
[13] DBI_0.7                    matrixStats_0.52.2
[15] bit64_0.9-7                digest_0.6.12
[17] tibble_1.3.4               Matrix_1.2-11
[19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6
[21] bitops_1.0-6               RCurl_1.95-4.8
[23] biomaRt_2.32.1             memoise_1.1.0
[25] RSQLite_2.0                DelayedArray_0.2.7
[27] compiler_3.4.1             Rsamtools_1.28.0
[29] Biostrings_2.44.2          XML_3.98-1.9
[31] pkgconfig_2.0.1

Fri, Oct 27, 2017 10:05 AM #

Hi Leonard,

Sorry for missing your earlier posts about this. Will look into it.

Thanks,
H.

On 10/27/2017 09:07 AM, Leonard Goldstein wrote:

Dear bioc-devel,

I noticed exonsBy is dropping a lot of genes when run on a TxDb object
created with makeTxDbFromBiomart (see below). Please also see related post
on the Bioconductor support site:

https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_101951_-23102160&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=xUz41e2CIjMTXmO5LqkFp12ySvH9zzA0s5OMWzgWosU&e=

Thanks for your help.

Leonard

--

tx <- makeTxDbFromBiomart()
txs_by_gene <- transcriptsBy(tx, "gene")
exs_by_gene <- exonsBy(tx, "gene")
length(txs_by_gene)

[1] 63967

length(exs_by_gene)

[1] 36751

subsetByOverlaps(txs_by_gene, GRanges("8", IRanges(127735434,127741434)))

GRangesList object of length 1:
$ENSG00000136997
GRanges object with 9 ranges and 2 metadata columns:
       seqnames                 ranges strand |     tx_id         tx_name
          <Rle>              <IRanges>  <Rle> | <integer>     <character>
   [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
   [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
   [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
   [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
   [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
   [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
   [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
   [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
   [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

subsetByOverlaps(exs_by_gene, GRanges("8", IRanges(127735434,127741434)))

GRangesList object of length 0:
<0 elements>

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2
[4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.13               XVector_0.16.0
  [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0
  [5] BiocParallel_1.10.1        bit_1.1-12
  [7] lattice_0.20-35            rlang_0.1.2
  [9] blob_1.1.0                 tools_3.4.1
[11] grid_3.4.1                 SummarizedExperiment_1.6.5
[13] DBI_0.7                    matrixStats_0.52.2
[15] bit64_0.9-7                digest_0.6.12
[17] tibble_1.3.4               Matrix_1.2-11
[19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6
[21] bitops_1.0-6               RCurl_1.95-4.8
[23] biomaRt_2.32.1             memoise_1.1.0
[25] RSQLite_2.0                DelayedArray_0.2.7
[27] compiler_3.4.1             Rsamtools_1.28.0
[29] Biostrings_2.44.2          XML_3.98-1.9
[31] pkgconfig_2.0.1

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=8iOGgTunhaWEQ1FdrTFYVPCHLcoDa2DDO2pAluGQrOQ&e=

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Fri, Oct 27, 2017 8:34 PM #

The problem seems to originate in the BioMart service itself:

   library(biomaRt)
   mart <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

   attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
   df1 <- getBM(attributes1, mart=mart)

   dim(df1)
   # [1] 1150974       2

   subset(df1, ensembl_transcript_id == "ENST00000485971")
   #         ensembl_transcript_id ensembl_exon_id
   # 1078905       ENST00000485971 ENSE00001952391
   # 1078906       ENST00000485971 ENSE00003639486
   # 1078907       ENST00000485971 ENSE00001881546

Correct. ENST00000485971 has 3 exons.

But if I request more attributes:

   attributes2 <- c(attributes1, "rank", "genomic_coding_start", 
"cds_start", "5_utr_start")
   df2 <- getBM(attributes2, mart=mart)

   dim(df2)
   # [1] 1051158       6

Uh, less rows?

   subset(df2, ensembl_transcript_id == "ENST00000485971")
   # [1] ensembl_transcript_id ensembl_exon_id       rank
   # [4] genomic_coding_start  cds_start             5_utr_start
<0 rows> (or 0-length row.names)

ENST00000485971 disappeared from the result!

It's like someone replaced a LEFT JOIN with an INNER JOIN somewhere
in the BioMart service.

The impact of this on makeTxDbFrombiomart() is that many exons indeed
don't make it into the TxDb object so we end up with many exon-less
transcripts. I'll add a sanity check and will error when this happens. 
Better fail with an informative error message than silently return
a broken TxDb.

Before I contact the BioMart people about this, I'd like to remove R
and the biomaRt package from the equation but I'm not sure how to do
that (I was not able to reproduce the problem by using their web
interface https://www.ensembl.org/biomart). Any help with this would
be greatly appreciated.

Thanks,
H.

 > sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/hpages/R/R-3.4.2-bioc35/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.4.2-bioc35/lib/libRlapack.so

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.32.1       BiocInstaller_1.26.1

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.13         IRanges_2.10.5       XML_3.98-1.9
  [4] digest_0.6.12        bitops_1.0-6         DBI_0.7
  [7] stats4_3.4.2         RSQLite_2.0          rlang_0.1.2
[10] blob_1.1.0           S4Vectors_0.14.7     tools_3.4.2
[13] bit64_0.9-7          Biobase_2.36.2       RCurl_1.95-4.8
[16] bit_1.1-12           parallel_3.4.2       compiler_3.4.2
[19] BiocGenerics_0.22.1  AnnotationDbi_1.38.2 memoise_1.1.0
[22] tibble_1.3.4

On 10/27/2017 10:05 AM, Herv? Pag?s wrote:

Hi Leonard,

Sorry for missing your earlier posts about this. Will look into it.

Thanks,
H.

On 10/27/2017 09:07 AM, Leonard Goldstein wrote:

Dear bioc-devel,

I noticed exonsBy is dropping a lot of genes when run on a TxDb object
created with makeTxDbFromBiomart (see below). Please also see related
post
on the Bioconductor support site:

https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_101951_-23102160&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=xUz41e2CIjMTXmO5LqkFp12ySvH9zzA0s5OMWzgWosU&e=


Thanks for your help.

Leonard

--

tx <- makeTxDbFromBiomart()
txs_by_gene <- transcriptsBy(tx, "gene")
exs_by_gene <- exonsBy(tx, "gene")
length(txs_by_gene)

[1] 63967

length(exs_by_gene)

[1] 36751

subsetByOverlaps(txs_by_gene, GRanges("8",
IRanges(127735434,127741434)))

GRangesList object of length 1:
$ENSG00000136997
GRanges object with 9 ranges and 2 metadata columns:
       seqnames                 ranges strand |     tx_id         tx_name
          <Rle>              <IRanges>  <Rle> | <integer>     <character>
   [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
   [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
   [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
   [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
   [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
   [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
   [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
   [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
   [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

subsetByOverlaps(exs_by_gene, GRanges("8",
IRanges(127735434,127741434)))

GRangesList object of length 0:
<0 elements>

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib

LAPACK:
/Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib


locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2
[4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1

loaded via a namespace (and not attached):
  [1] Rcpp_0.12.13               XVector_0.16.0
  [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0
  [5] BiocParallel_1.10.1        bit_1.1-12
  [7] lattice_0.20-35            rlang_0.1.2
  [9] blob_1.1.0                 tools_3.4.1
[11] grid_3.4.1                 SummarizedExperiment_1.6.5
[13] DBI_0.7                    matrixStats_0.52.2
[15] bit64_0.9-7                digest_0.6.12
[17] tibble_1.3.4               Matrix_1.2-11
[19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6
[21] bitops_1.0-6               RCurl_1.95-4.8
[23] biomaRt_2.32.1             memoise_1.1.0
[25] RSQLite_2.0                DelayedArray_0.2.7
[27] compiler_3.4.1             Rsamtools_1.28.0
[29] Biostrings_2.44.2          XML_3.98-1.9
[31] pkgconfig_2.0.1

    [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ExVvVr6r4wUKgBmPNOQiJXSI2O1FYqFRz_llKQY_jD0&s=8iOGgTunhaWEQ1FdrTFYVPCHLcoDa2DDO2pAluGQrOQ&e=

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Sat, Oct 28, 2017 4:00 AM #

Good day,

I stepped through the code until execution reached the end of postForm in RCurl which is called by getBM and obtains the textual result from the server. If I check the contents of write$value(), the example missing transcript is not there.

Browse[3]> grep("ENST00000485971", write$value())
integer(0)

write$value is a weird function. It's prototype is function (collapse = "", ...) but its body contains code such as

    if (is.null(collapse)) 
        return(txt)

I wonder where txt is created. It's not passed as an extra variable.

Browse[7]> print(list(...))
list()

Searching the R code reveals that txt is created as a global variable in another function named dynCurlReader by the code statement txt <<- character().

RCurl also uses functions that don't begin with a dot but are undocumented.

ans = encode(ans)
Browse[7]> ?encode
No documentation for ?encode? in specified packages and libraries

Anyway, the transcript ID is also missing from txt.

Browse[7]> grep("ENST00000485971", txt)
integer(0)

It's hard to know what the obfuscated code of RCurl is doing.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

Mike Smith

Sat, Oct 28, 2017 3:32 PM #

My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You can
check this with the code at
https://gist.github.com/grimbough/7e7a47b7a4f64915220ce35cc1ce8f39


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're applying
no filter at all.  Problems related to this have been reported before (
https://support.bioconductor.org/p/86358/) and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)

[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes

[1] TRUE

And some of these transcripts are clearly not present in the first query:

"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

1 day later

Sun, Oct 29, 2017 10:19 PM #

Thanks Dario and Mike for looking into this.

In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():

   > library(GenomicFeatures)

   > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", 
server="useastdb.ensembl.org"))
   Fetch transcripts and genes from Ensembl ... OK
   Fetch exons and CDS from Ensembl ... OK
   Fetch chromosome names and lengths from Ensembl ...OK
   Gather the metadata ... OK
   Make the TxDb object ... OK
      user  system elapsed
    46.420   1.271  58.270

   > txdb
   TxDb object:
   # Db type: TxDb
   # Supporting package: GenomicFeatures
   # Data source: Ensembl
   # Organism: Homo sapiens
   # Ensembl release: 90
   # Ensembl database: homo_sapiens_core_90_38
   # MySQL server: useastdb.ensembl.org
   # transcript_nrow: 220144
   # exon_nrow: 757782
   # cds_nrow: 789614
   # Db created by: GenomicFeatures package from Bioconductor
   # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
   # GenomicFeatures version at creation time: 1.29.16
   # RSQLite version at creation time: 2.0
   # DBSCHEMAVERSION: 1.2

It's in GenomicFeatures 1.29.16.

Cheers,
H.

On 10/28/2017 03:32 PM, Mike Smith wrote:

My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You can
check this with the code at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're applying
no filter at all.  Problems related to this have been reported before (
https://urldefense.proofpoint.com/v2/url?u=https-3A__support.bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQMr80Q8K2um6sKTggHw0&e=) and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)

dim(df3)

[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes

identical(df3[,1], df4[,1])

[1] TRUE

And some of these transcripts are clearly not present in the first query:

length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,

"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:

Good day,

I stepped through the code until execution reached the end of postForm in
RCurl which is called by getBM and obtains the textual result from the
server. If I check the contents of write$value(), the example missing
transcript is not there.

Browse[3]> grep("ENST00000485971", write$value())
integer(0)

write$value is a weird function. It's prototype is function (collapse =
"", ...) but its body contains code such as

     if (is.null(collapse))
         return(txt)

I wonder where txt is created. It's not passed as an extra variable.

Browse[7]> print(list(...))
list()

Searching the R code reveals that txt is created as a global variable in
another function named dynCurlReader by the code statement txt <<-
character().

RCurl also uses functions that don't begin with a dot but are undocumented.

ans = encode(ans)
Browse[7]> ?encode
No documentation for ?encode? in specified packages and libraries

Anyway, the transcript ID is also missing from txt.

Browse[7]> grep("ENST00000485971", txt)
integer(0)

It's hard to know what the obfuscated code of RCurl is doing.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJDMeyF5e0R7MZWxNlxOxbXiH1zLGA&e=

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Leonard Goldstein

Mon, Oct 30, 2017 11:58 AM #

Thanks Herv? and others for looking into this.

Leonard

On Sun, Oct 29, 2017 at 10:19 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:

Thanks Dario and Mike for looking into this.

In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():

  > library(GenomicFeatures)

  > system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", server="

useastdb.ensembl.org"))
  Fetch transcripts and genes from Ensembl ... OK
  Fetch exons and CDS from Ensembl ... OK
  Fetch chromosome names and lengths from Ensembl ...OK
  Gather the metadata ... OK
  Make the TxDb object ... OK
     user  system elapsed
   46.420   1.271  58.270

  > txdb

  TxDb object:
  # Db type: TxDb
  # Supporting package: GenomicFeatures
  # Data source: Ensembl
  # Organism: Homo sapiens
  # Ensembl release: 90
  # Ensembl database: homo_sapiens_core_90_38
  # MySQL server: useastdb.ensembl.org
  # transcript_nrow: 220144
  # exon_nrow: 757782
  # cds_nrow: 789614
  # Db created by: GenomicFeatures package from Bioconductor
  # Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
  # GenomicFeatures version at creation time: 1.29.16
  # RSQLite version at creation time: 2.0
  # DBSCHEMAVERSION: 1.2

It's in GenomicFeatures 1.29.16.

Cheers,
H.


On 10/28/2017 03:32 PM, Mike Smith wrote:

My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead.  You
can
check this with the code at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.gi
thub.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=
DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0W
YiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7F
WBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=


I wonder if this is related to BioMart's instability when you submit large
queries.  The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're
applying
no filter at all.  Problems related to this have been reported before (
https://urldefense.proofpoint.com/v2/url?u=https-3A__support
.bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfh
Q&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-
BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQ
Mr80Q8K2um6sKTggHw0&e=) and I patched the devel version

of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.

Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.

all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)

dim(df3)

[1] 1309356       6

The number of returned results is consistent for even with the larger
number of attributes

identical(df3[,1], df4[,1])

[1] TRUE

And some of these transcripts are clearly not present in the first query:

length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,

"ensembl_transcript_id"] ))
[1] 28128

It could be that my batched code is doing something wrong, but I'm
starting
to suspect that even the first query is dropping data silently!

Mike


On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:

Good day,

I stepped through the code until execution reached the end of postForm in
RCurl which is called by getBM and obtains the textual result from the
server. If I check the contents of write$value(), the example missing
transcript is not there.

Browse[3]> grep("ENST00000485971", write$value())
integer(0)

write$value is a weird function. It's prototype is function (collapse =
"", ...) but its body contains code such as

     if (is.null(collapse))
         return(txt)

I wonder where txt is created. It's not passed as an extra variable.

Browse[7]> print(list(...))
list()

Searching the R code reveals that txt is created as a global variable in
another function named dynCurlReader by the code statement txt <<-
character().

RCurl also uses functions that don't begin with a dot but are
undocumented.

ans = encode(ans)
Browse[7]> ?encode
No documentation for ?encode? in specified packages and libraries

Anyway, the transcript ID is also missing from txt.

Browse[7]> grep("ENST00000485971", txt)
integer(0)

It's hard to know what the obfuscated code of RCurl is doing.

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj
pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD
MeyF5e0R7MZWxNlxOxbXiH1zLGA&e=

_______________________________________________
Bioc-devel at r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et
hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt
84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=nj
pXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=1qsHGdbNcXIIZCJD
MeyF5e0R7MZWxNlxOxbXiH1zLGA&e=

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel