Skip to content
Prev 6828 / 21312 Next

[Bioc-devel] IRanges findOverlaps Result Different for Recent Update

Hi guys,

Indeed, the Hits object returned by findOverlaps() is not fully
sorted anymore. Now it's sorted by query hit *only* and not by query
hit *and* subject hit. Fully sorting a big Hits object has a high
cost, both in terms of time and memory footprint. The partial
sorting is *much* cheaper: it's done using a "tabulated sorting"
algo implemented in C that works in linear time.

The partial sorting is important: it allows a very common
transformation like as(hits, "List") to be super fast. But the
full sorting was overkill and generally not needed. Also note that
the full sorting was never enforced via the validity method for
Hits objects (and t(hits) was breaking that order in BioC < 3.1).
Now the validity method for Hits enforces the partial sorting and
t(hits) preserves it.

There were only 3 or 4 packages that broke in devel because of
that change (typically the change broke their unit tests). I fixed
them (except Repitools, but it's still on my list). The fix is easy:
if having the hits fully sorted matters, just use sort() on the Hits
object. The man page for ?findOverlaps will soon be updated to
reflect these changes.

Cheers,
H.
On 01/15/2015 06:42 AM, Kasper Daniel Hansen wrote: