Skip to content

[Bioc-devel] RFC: IntervalTrees for GRanges objects

9 messages · Michael Lawrence, Hector Corrada Bravo, Kasper Daniel Hansen +2 more

#
Hello bioc-develers,

I'm writing an application where lots findOverlap calls are made on
static GRanges objects. For IRanges we can create persistent
IntervalTree objects that would serve the multiple overlap query
use-case. There is no equivalent for GenomicRanges objects, so I'm
proposing an implementation for this.

Please check
http://github.com/hcorrada/GenomicIntervalTree

There's a first cut implementation there you can test by installing
this skeleton package. E.g,
Let me know what you think.

Cheers,
Hector
#
Yep, I didn't comment on that, but I agree that abstracting how
GRanges stores ranges would make this more elegant. Right now
ranges(GRanges) is specified to be of IRanges class instead of the
abstract Ranges class.

If it were the latter then GIntervalTree can be a subclass of
GenomicRanges, in a similar way that IntervalTree is a subclass of
Ranges.

On Wed, Apr 3, 2013 at 12:23 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
#
On 04/03/2013 10:28 AM, Kasper Daniel Hansen wrote:
A cheap (both computationally and philosophically) hack is along the lines of

gr <- GRanges(c("A", "B", "A"), IRanges(c(2000, 1000, 3000), width=100))

rng <- range(gr)
off0 <- (width(rng)[-length(rng)] + 1L)
offset <- setNames(c(1L, cumsum(off0)) - start(rng), seqnames(rng))
shift(ranges(gr), offset[as.character(seqnames(gr))])

which shifts the ranges to be non-overlapping by seqname, so findOverlaps can 
operate on the IRanges directly. This is a trick I used in selectMethod(disjoin, 
"CompressedIRangesList"). There's the risk of integer overflow (when 
sum(width(rng) + 1L) > .Machine$integer.max) but this could be handled (if it 
occurs!) by partitioning into .Machine$integer.max-sized ranges. Unless IRanges 
were updated to support longer integers...

Martin

  
    
#
On 2013-04-03 19:28, Kasper Daniel Hansen wrote:
I can only confirm: many many people do not close genomes, because for a 
number of applications this is just not worth the trouble... for now.
Given the number of cases I have seen, that's now concerning the very 
large majority of genomes.

L.
4 days later