In my experience, GRanges are very convenient and amazingly fast.
I have recently started to work on an organism with a more unfinished
genome. Instead of the situation for model organisms with high
quality reference genomes where we have few, very long, chromosomes,
the organism I am working with has many, much smaller, contigs. By
many I mean on the order of 10,000. Given how hard it is to currently
finish a genome to high quality, we are likely to see many organisms
with such fragmented reference genomes.
findOverlaps is quite slow in such situations. Essentially
findOverlaps splits the GRanges by the seqnames and then does a lapply
over the (unique) seqnames (with a findOverlaps). Example:
library(GenomicRanges)
simul <- function(nSamples = 10^5, nChr = 1000) {
start <- sample(1:10^5, size = nSamples, replace = TRUE)
chr <- paste("chr", sample(1:nChr, size = nSamples, replace =
TRUE), sep = "")
gr <- GRanges(seqnames = chr, ranges = IRanges(start = start, width = 1000))
gr
}
set.seed(1234)
gr1 <- simul()
gr2 <- simul()
gr2.1 <- gr2[1]
seqlevels(gr2.1) <- as.character(seqnames(gr2.1))
system.time({
findOverlaps(gr1, gr2)
})
system.time({
findOverlaps(gr1, gr2[1])
})
system.time({
findOverlaps(gr1, gr2.1)
})
source("findOverlaps-methods.R")
On my laptop the three calls take roughly 19.7sec, 16.1sec and
0.07sec. The difference between the last two calls is unnecessary and
arises because findOverlaps ends up sapply over many empty seqnames.
A fix is
Index: findOverlaps-methods.R
===================================================================
--- findOverlaps-methods.R (revision 60860)
+++ findOverlaps-methods.R (working copy)
@@ -71,11 +71,11 @@
} else {
querySeqnames <- seqnames(query)
querySplitRanges <- splitRanges(querySeqnames)
- uniqueQuerySeqnames <- names(querySplitRanges)
+ uniqueQuerySeqnames <-
names(querySplitRanges)[sapply(querySplitRanges, length) > 0]
subjectSeqnames <- seqnames(subject)
subjectSplitRanges <- splitRanges(subjectSeqnames)
- uniqueSubjectSeqnames <- names(subjectSplitRanges)
+ uniqueSubjectSeqnames <-
names(subjectSplitRanges)[sapply(subjectSplitRanges, length) > 0]
commonSeqnames <-
intersect(uniqueQuerySeqnames, uniqueSubjectSeqnames)
I am guessing that a fix like this might also make sense for coverage.
However, this does not solve the "slowness" of findOverlaps for
GRanges like gr1 and gr2 above (many seqlevels, but also long).
Perhaps there is some obvious way to deal with this? It would be nice
if GRanges are also fast with many seqlevels.
Kasper
[Bioc-devel] improvement to findOverlap for GRanges
3 messages · Kasper Daniel Hansen, Michael Lawrence
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20111126/cd60b384/attachment.pl>
On Sat, Nov 26, 2011 at 6:58 PM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
As an ultimate goal, it make sense to vectorize things over many seqnames. The IntervalTree could also be made to store data over multiple spaces (like seqname, strand combinations) and could then be cached within a GenomicRanges object. That way, the tree is only built once.
This would, of course, be awesome. One other piece of info is that organisms with many contigs typically have shorter contigs. One could perhaps start by ignoring the seqnames, do a findOverlaps on the underlying ranges and then use seqnames afterwards to sort it out. Not sure whether this is a good idea. Kasper
Michael On Sat, Nov 26, 2011 at 1:55 PM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
In my experience, GRanges are very convenient and amazingly fast.
I have recently started to work on an organism with a more unfinished
genome. ?Instead of the situation for model organisms with high
quality reference genomes where we have few, very long, chromosomes,
the organism I am working with has many, much smaller, contigs. ?By
many I mean on the order of 10,000. ?Given how hard it is to currently
finish a genome to high quality, we are likely to see many organisms
with such fragmented reference genomes.
findOverlaps is quite slow in such situations. ?Essentially
findOverlaps splits the GRanges by the seqnames and then does a lapply
over the (unique) seqnames (with a findOverlaps). ?Example:
library(GenomicRanges)
simul <- function(nSamples = 10^5, nChr = 1000) {
? ?start <- sample(1:10^5, size = nSamples, replace = TRUE)
? ?chr <- paste("chr", sample(1:nChr, size = nSamples, replace =
TRUE), sep = "")
? ?gr <- GRanges(seqnames = chr, ranges = IRanges(start = start, width =
1000))
? ?gr
}
set.seed(1234)
gr1 <- simul()
gr2 <- simul()
gr2.1 <- gr2[1]
seqlevels(gr2.1) <- as.character(seqnames(gr2.1))
system.time({
? ?findOverlaps(gr1, gr2)
})
system.time({
? ?findOverlaps(gr1, gr2[1])
})
system.time({
? ?findOverlaps(gr1, gr2.1)
})
source("findOverlaps-methods.R")
On my laptop the three calls take roughly 19.7sec, 16.1sec and
0.07sec. ?The difference between the last two calls is unnecessary and
arises because findOverlaps ends up sapply over many empty seqnames.
A fix is
Index: findOverlaps-methods.R
===================================================================
--- findOverlaps-methods.R ? ? ?(revision 60860)
+++ findOverlaps-methods.R ? ? ?(working copy)
@@ -71,11 +71,11 @@
? ? ? ? } else {
? ? ? ? ? ? querySeqnames <- seqnames(query)
? ? ? ? ? ? querySplitRanges <- splitRanges(querySeqnames)
- ? ? ? ? ? ?uniqueQuerySeqnames <- names(querySplitRanges)
+ ? ? ? ? ? ?uniqueQuerySeqnames <-
names(querySplitRanges)[sapply(querySplitRanges, length) > 0]
? ? ? ? ? ? subjectSeqnames <- seqnames(subject)
? ? ? ? ? ? subjectSplitRanges <- splitRanges(subjectSeqnames)
- ? ? ? ? ? ?uniqueSubjectSeqnames <- names(subjectSplitRanges)
+ ? ? ? ? ? ?uniqueSubjectSeqnames <-
names(subjectSplitRanges)[sapply(subjectSplitRanges, length) > 0]
? ? ? ? ? ? commonSeqnames <-
? ? ? ? ? ? ? intersect(uniqueQuerySeqnames, uniqueSubjectSeqnames)
I am guessing that a fix like this might also make sense for coverage.
However, this does not solve the "slowness" of findOverlaps for
GRanges like gr1 and gr2 above (many seqlevels, but also long).
Perhaps there is some obvious way to deal with this? ?It would be nice
if GRanges are also fast with many seqlevels.
Kasper
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel