An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140710/0ddc1f38/attachment.pl>
[Bioc-devel] range-directed metadata management
4 messages · Vincent Carey, Steve Lianoglou, Michael Lawrence
Hi, On Thu, Jul 10, 2014 at 1:52 PM, Vincent Carey
<stvjc at channing.harvard.edu> wrote:
a new, more inclusive GWAS catalog is available (GRASP, from Andrew Johnson at NHLBI), with 6 million records and voluminous metadata (though it seems sparse and perhaps can be trimmed/reshaped) i made a GRanges and it takes 3 minutes to load. even after stripping all the metadata, a GRanges with 6 million records takes 20 seconds to load. that's probably acceptable, but a managed chromosome-specific distribution might be closer to interactive availability. the metadata probably would be best kept in SQLite. it occurred to me to consider an arrangement in which we have the GRanges managing the ranges and a key to the database. range operations can engender queries to retrieve metadata, metadata queries in the db can generate indices to retrieve matching ranges. is anyone doing something along these lines?
You might consider just stuffing it all in the database. SQLite supports RTrees, which is a spatial index, so you could in theory get the fast overlap stuff baked in w/o a need to have a parallel GRanges object to index into the database: http://www.sqlite.org/rtree.html Before the reboot of the GenomicFeatures package (we're talking around 2008/2009?) I was doing something like that for genomic annotations. The way that Hadley has abstracted db access in dplyr to make a database look like a data.frame and respond to all the "data manipulation verbs" in the same way gives me inspiration to believe that we can do the same and make the database look essentially like a GRanges / VRanges object and get cooking that way. Hopefully this answer was at least minimally aligned in the direction of what you were asking ;-) -steve
Steve Lianoglou Computational Biologist Genentech
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140710/11afc1b6/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140710/05b4f2c6/attachment.pl>