[Bioc-devel] improvement to findOverlap for GRanges

In my experience, GRanges are very convenient and amazingly fast.

I have recently started to work on an organism with a more unfinished
genome.  Instead of the situation for model organisms with high
quality reference genomes where we have few, very long, chromosomes,
the organism I am working with has many, much smaller, contigs.  By
many I mean on the order of 10,000.  Given how hard it is to currently
finish a genome to high quality, we are likely to see many organisms
with such fragmented reference genomes.

findOverlaps is quite slow in such situations.  Essentially
findOverlaps splits the GRanges by the seqnames and then does a lapply
over the (unique) seqnames (with a findOverlaps).  Example:

library(GenomicRanges)

simul <- function(nSamples = 10^5, nChr = 1000) {
    start <- sample(1:10^5, size = nSamples, replace = TRUE)
    chr <- paste("chr", sample(1:nChr, size = nSamples, replace =
TRUE), sep = "")
    gr <- GRanges(seqnames = chr, ranges = IRanges(start = start, width = 1000))
    gr
}

set.seed(1234)
gr1 <- simul()
gr2 <- simul()
gr2.1 <- gr2[1]
seqlevels(gr2.1) <- as.character(seqnames(gr2.1))
system.time({
    findOverlaps(gr1, gr2)
})
system.time({
    findOverlaps(gr1, gr2[1])
})
system.time({
    findOverlaps(gr1, gr2.1)
})
source("findOverlaps-methods.R")

On my laptop the three calls take roughly 19.7sec, 16.1sec and
0.07sec.  The difference between the last two calls is unnecessary and
arises because findOverlaps ends up sapply over many empty seqnames.
A fix is

Index: findOverlaps-methods.R
===================================================================

--- findOverlaps-methods.R      (revision 60860)
+++ findOverlaps-methods.R      (working copy)
@@ -71,11 +71,11 @@
         } else {
             querySeqnames <- seqnames(query)
             querySplitRanges <- splitRanges(querySeqnames)
-            uniqueQuerySeqnames <- names(querySplitRanges)
+            uniqueQuerySeqnames <-
names(querySplitRanges)[sapply(querySplitRanges, length) > 0]

             subjectSeqnames <- seqnames(subject)
             subjectSplitRanges <- splitRanges(subjectSeqnames)
-            uniqueSubjectSeqnames <- names(subjectSplitRanges)
+            uniqueSubjectSeqnames <-
names(subjectSplitRanges)[sapply(subjectSplitRanges, length) > 0]

             commonSeqnames <-
               intersect(uniqueQuerySeqnames, uniqueSubjectSeqnames)

I am guessing that a fix like this might also make sense for coverage.

However, this does not solve the "slowness" of findOverlaps for
GRanges like gr1 and gr2 above (many seqlevels, but also long).
Perhaps there is some obvious way to deal with this?  It would be nice
if GRanges are also fast with many seqlevels.

Kasper