[Bioc-devel] faster gene id conversion?

Question regarding gene name conversions. Once upon a time, I was doing a
lot of gene name conversions, particularly from NM_#### to HGNC symbol or
Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so
I could quickly merge() it instead of calling out to a webservice
repeatedly. Later the complexity of keeping the cache updated became
overwhelming, and carrying around a few megabytes of possibly outdated
identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.

These sometimes map to multiple values and sometimes map to nothing,
causing errors in my code. To clean it up, I wrapped their accessors with
some error checking. Things work again, assigning one human readable name
per transcript ID#. Problem is this method is very slow. I thought it could
be the error checking code, but even trying to streamline that doesn't
help. A profiler showed that most of my time was spent in .Call, actually
it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
was calling a sqlite query. Since I am nesting these calls in a loop, (NM
to EG to HGNC, a few thousands of times), these copious calls out to sqlite
are killing me.

Hi, Karl.

It is a little hard to diagnose problems without code, but here is a little
code to get a sense of how I might accomplish the task you are describing.
I include timing information.  If this isn't a representative workflow,
perhaps you can show us some code and timing information.

Sean
# Get all human refseq accessions
refseqs = keys(org.Hs.eg.db,keytype="REFSEQ")
# Time the lookup for symbol and entrez ID
system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
+                             keys=refseqs,
+                             columns=c('REFSEQ','SYMBOL','ENTREZID'))))
   user  system elapsed
  2.170   0.071   2.259
head(symbols)
REFSEQ SYMBOL ENTREZID
1    NM_130786   A1BG        1
2    NP_570602   A1BG        1
3    NM_000014    A2M        2
4    NP_000005    A2M        2
5 XM_006719056    A2M        2
6 XP_006719119    A2M        2
I need a way to batch query, or preload to memory these lookup tables. I
tried using a hash, but checking if a value is already loaded into the
hash-cache is equally time consuming; and preloading the whole of
org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
object, but we're back to the local-outdated cache problem.

So I think the only solution would be to access the sqlite underlying the
org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.

I assume this problem has been handled before, and ask for your guidance.

Thanks

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] faster gene id conversion?

Thread (5 messages)