Thanks Dario and Mike for looking into this.
In the mean time I added makeTxDbFromEnsembl() for creating a TxDb
object by querying directly an Ensembl MySQL server. Only lightly
tested but so far seems to faster and more reliable than
makeTxDbFromBiomart():
> library(GenomicFeatures)
> system.time(txdb <- makeTxDbFromEnsembl("Homo sapiens", server="
useastdb.ensembl.org"))
Fetch transcripts and genes from Ensembl ... OK
Fetch exons and CDS from Ensembl ... OK
Fetch chromosome names and lengths from Ensembl ...OK
Gather the metadata ... OK
Make the TxDb object ... OK
user system elapsed
46.420 1.271 58.270
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Ensembl
# Organism: Homo sapiens
# Ensembl release: 90
# Ensembl database: homo_sapiens_core_90_38
# MySQL server: useastdb.ensembl.org
# transcript_nrow: 220144
# exon_nrow: 757782
# cds_nrow: 789614
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-10-29 22:13:21 -0700 (Sun, 29 Oct 2017)
# GenomicFeatures version at creation time: 1.29.16
# RSQLite version at creation time: 2.0
# DBSCHEMAVERSION: 1.2
It's in GenomicFeatures 1.29.16.
Cheers,
H.
On 10/28/2017 03:32 PM, Mike Smith wrote:
My feeling is that this isn't related to RCurl, since the example
transcript is missing if you use httr to submit the query instead. You
can
check this with the code at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.gi
thub.com_grimbough_7e7a47b7a4f64915220ce35cc1ce8f39&d=
DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0W
YiZvSXAJJKaaPhzWA&m=njpXf5TkV-BLhepKQRWVTdAzBmWWCsDj7RIPgs7F
WBM&s=orkmK9y7kVCLbLB6tlMJfgT5V_GKzVysZ00DMevuAm4&e=
I wonder if this is related to BioMart's instability when you submit large
queries. The web interface suggests no more than 500 entries when
filtering on something like a gene ID, but in these examples we're
applying
no filter at all. Problems related to this have been reported before (
https://urldefense.proofpoint.com/v2/url?u=https-3A__support
.bioconductor.org_p_86358_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfh
Q&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=njpXf5TkV-
BLhepKQRWVTdAzBmWWCsDj7RIPgs7FWBM&s=CxVV4WArQc4xcf74Az8V8vfQ
Mr80Q8K2um6sKTggHw0&e=) and I patched the devel version
of biomaRt to submit queries in batches of 500 - although this currently
has no effect if there isn't a set of 'values' to split into batches.
Interestingly, if I first obtain a list of all gene IDs in Ensembl, and
then submit those in a batched query I get more exons than your first
approach.
all_genes <- getBM(attributes = "ensembl_gene_id", mart=mart)
attributes1 <- c("ensembl_transcript_id", "ensembl_exon_id")
attributes2 <- c(attributes1, "rank", "genomic_coding_start", "cds_start",
"5_utr_start")
df3 <- getBM(attributes1, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
df4 <- getBM(attributes2, filters = "ensembl_gene_id", values =
all_genes[,1], mart=mart)
dim(df3)
[1] 1309356 6
The number of returned results is consistent for even with the larger
number of attributes
identical(df3[,1], df4[,1])
[1] TRUE
And some of these transcripts are clearly not present in the first query:
length(which( !unique(df3[,"ensembl_transcript_id"]) %in% df1[,
"ensembl_transcript_id"] ))
[1] 28128
It could be that my batched code is doing something wrong, but I'm
starting
to suspect that even the first query is dropping data silently!
Mike
On 28 October 2017 at 13:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au>
wrote:
Good day,
I stepped through the code until execution reached the end of postForm in
RCurl which is called by getBM and obtains the textual result from the
server. If I check the contents of write$value(), the example missing
transcript is not there.
Browse[3]> grep("ENST00000485971", write$value())
integer(0)
write$value is a weird function. It's prototype is function (collapse =
"", ...) but its body contains code such as
if (is.null(collapse))
return(txt)
I wonder where txt is created. It's not passed as an extra variable.
Browse[7]> print(list(...))
list()
Searching the R code reveals that txt is created as a global variable in
another function named dynCurlReader by the code statement txt <<-
character().
RCurl also uses functions that don't begin with a dot but are
undocumented.
ans = encode(ans)
Browse[7]> ?encode
No documentation for ?encode? in specified packages and libraries
Anyway, the transcript ID is also missing from txt.
Browse[7]> grep("ENST00000485971", txt)
integer(0)
It's hard to know what the obfuscated code of RCurl is doing.
--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia