Skip to content
Prev 5214 / 21312 Next

[Bioc-devel] coverage as IntegerList

Hi,

Why not. But I don't expect a significant speed up. Here is why:

There are actually 2 algos implemented by coverage(): one called "sort"
that computes the coverage directly into "Rle space", and one called
"hash" that computes the coverage into an ordinary integer vector and
turns this vector into an Rle at the end (this conversion is cheap).

By default coverage() tries to automatically pick up the appropriate
algo: "hash" when the data are dense, "sort" otherwise. The criteria
used to decide whether the data are dense or not is a little bit
naive (and could maybe be improved?): it just compares the number
of ranges in the input with the length of the coverage vector to
return. If nb of ranges > 0.25 * length-of-coverage-vector, the data
is considered to be dense. Clearly this formula is kind of arbitrary
and I'm sure it could be tweaked a little bit to do a better job.

Note that the user can choose the algo to use via the 'method' arg.
If you know your data are dense, use method="hash". It will be almost
as fast as if coverage() was returning an IntegerList, except that
the coverage is turned into an Rle (but only at the end). I would
expect this final coercion to be nothing compared to the computation
of the coverage itself. This would need to be confirmed by some
profiling though.

Anyway maybe there are other benefits of returning an IntegerList:
smaller memory footprint when the data are dense,
more beginner-friendly container, maybe slightly faster
downstream computations (can this be a bottleneck?), others?

H.
On 02/11/2014 05:06 PM, Michael Lawrence wrote: