[Bioc-devel] RFC: IntervalTrees for GRanges objects
On 04/03/2013 10:28 AM, Kasper Daniel Hansen wrote:
Making IntervalTree chromosome would also be a great addition for organisms with many sequences, like bee (due to an incomplete genome; 10,000s of sequences). It does not matter for humans, but findOverlaps is excruciatingly slow for bee's. I have a couple of posts on this in the archive. I am predicting this will be the case in the future for most non-model organisms; finishing a genome is expensive and time consuming.
A cheap (both computationally and philosophically) hack is along the lines of
gr <- GRanges(c("A", "B", "A"), IRanges(c(2000, 1000, 3000), width=100))
rng <- range(gr)
off0 <- (width(rng)[-length(rng)] + 1L)
offset <- setNames(c(1L, cumsum(off0)) - start(rng), seqnames(rng))
shift(ranges(gr), offset[as.character(seqnames(gr))])
which shifts the ranges to be non-overlapping by seqname, so findOverlaps can
operate on the IRanges directly. This is a trick I used in selectMethod(disjoin,
"CompressedIRangesList"). There's the risk of integer overflow (when
sum(width(rng) + 1L) > .Machine$integer.max) but this could be handled (if it
occurs!) by partitioning into .Machine$integer.max-sized ranges. Unless IRanges
were updated to support longer integers...
Martin
Kasper On Wed, Apr 3, 2013 at 12:45 PM, Michael Lawrence <lawrence.michael at gene.com
wrote:
Some ideas: - Turn the IntervalTree into a list/array of nodes that can be subset/reordered with shallow copying (just copy the pointers to the nodes), and the index would be secondary. The index in the array could be stored in each node, for lookup during overlap queries. Right now, as far as I can tell, GIntervalTree will get confused if the user reorders e.g. via [. - Make IntervalTree aware of the sequence/chromosome, e.g., have a hash of trees, which is trivial since seqnames is already a factor. Michael On Wed, Apr 3, 2013 at 9:29 AM, Hector Corrada Bravo <hcorrada at gmail.com
wrote:
Yep, I didn't comment on that, but I agree that abstracting how GRanges stores ranges would make this more elegant. Right now ranges(GRanges) is specified to be of IRanges class instead of the abstract Ranges class. If it were the latter then GIntervalTree can be a subclass of GenomicRanges, in a similar way that IntervalTree is a subclass of Ranges. On Wed, Apr 3, 2013 at 12:23 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Hi Hector, That's interesting, thanks for passing this along. I'm still wishing
that
somehow GRanges itself could abstract the way it stores ranges. I know
that
Herve/Patrick had some reasons for depending specifically on GRanges.
One
reason was probably convenience at the C level, but it wouldn't be hard
to
create a Ranges abstraction at the C level, as well. Michael On Tue, Apr 2, 2013 at 5:40 PM, Hector Corrada Bravo <
hcorrada at gmail.com
wrote:
Hello bioc-develers, I'm writing an application where lots findOverlap calls are made on static GRanges objects. For IRanges we can create persistent IntervalTree objects that would serve the multiple overlap query use-case. There is no equivalent for GenomicRanges objects, so I'm proposing an implementation for this. Please check http://github.com/hcorrada/GenomicIntervalTree There's a first cut implementation there you can test by installing this skeleton package. E.g,
library(devtools)
install_github("GenomicIntervalTree", username="hcorrada",
subdir="pkg")
library(GenomicIntervalTree)
Let me know what you think. Cheers, Hector
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793