[Bioc-devel] faster gene id conversion?
On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com> wrote:
Question regarding gene name conversions. Once upon a time, I was doing a lot of gene name conversions, particularly from NM_#### to HGNC symbol or Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so I could quickly merge() it instead of calling out to a webservice repeatedly. Later the complexity of keeping the cache updated became overwhelming, and carrying around a few megabytes of possibly outdated identifiers is a bad idea. Per Bioconductor guidelines, I switched to the built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL. These sometimes map to multiple values and sometimes map to nothing, causing errors in my code. To clean it up, I wrapped their accessors with some error checking. Things work again, assigning one human readable name per transcript ID#. Problem is this method is very slow. I thought it could be the error checking code, but even trying to streamline that doesn't help. A profiler showed that most of my time was spent in .Call, actually it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1] was calling a sqlite query. Since I am nesting these calls in a loop, (NM to EG to HGNC, a few thousands of times), these copious calls out to sqlite are killing me.
Hi, Karl. It is a little hard to diagnose problems without code, but here is a little code to get a sense of how I might accomplish the task you are describing. I include timing information. If this isn't a representative workflow, perhaps you can show us some code and timing information. Sean
# Get all human refseq accessions refseqs = keys(org.Hs.eg.db,keytype="REFSEQ") # Time the lookup for symbol and entrez ID system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
+ keys=refseqs,
+ columns=c('REFSEQ','SYMBOL','ENTREZID'))))
user system elapsed
2.170 0.071 2.259
head(symbols)
REFSEQ SYMBOL ENTREZID 1 NM_130786 A1BG 1 2 NP_570602 A1BG 1 3 NM_000014 A2M 2 4 NP_000005 A2M 2 5 XM_006719056 A2M 2 6 XP_006719119 A2M 2
I need a way to batch query, or preload to memory these lookup tables. I
tried using a hash, but checking if a value is already loaded into the
hash-cache is equally time consuming; and preloading the whole of
org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
object, but we're back to the local-outdated cache problem.
So I think the only solution would be to access the sqlite underlying the
org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
I assume this problem has been handled before, and ask for your guidance.
Thanks
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel