[Bioc-devel] coverage as IntegerList
Hi, Why not. But I don't expect a significant speed up. Here is why: There are actually 2 algos implemented by coverage(): one called "sort" that computes the coverage directly into "Rle space", and one called "hash" that computes the coverage into an ordinary integer vector and turns this vector into an Rle at the end (this conversion is cheap). By default coverage() tries to automatically pick up the appropriate algo: "hash" when the data are dense, "sort" otherwise. The criteria used to decide whether the data are dense or not is a little bit naive (and could maybe be improved?): it just compares the number of ranges in the input with the length of the coverage vector to return. If nb of ranges > 0.25 * length-of-coverage-vector, the data is considered to be dense. Clearly this formula is kind of arbitrary and I'm sure it could be tweaked a little bit to do a better job. Note that the user can choose the algo to use via the 'method' arg. If you know your data are dense, use method="hash". It will be almost as fast as if coverage() was returning an IntegerList, except that the coverage is turned into an Rle (but only at the end). I would expect this final coercion to be nothing compared to the computation of the coverage itself. This would need to be confirmed by some profiling though. Anyway maybe there are other benefits of returning an IntegerList: smaller memory footprint when the data are dense, more beginner-friendly container, maybe slightly faster downstream computations (can this be a bottleneck?), others? H.
On 02/11/2014 05:06 PM, Michael Lawrence wrote:
Right, it would be a choice. The compression is not worth it when the data are dense. On Tue, Feb 11, 2014 at 4:18 PM, Kasper Daniel Hansen < kasperdanielhansen at gmail.com> wrote:
Sounds reasonable, _especially_ if you think it is faster. You're the expert. I assume you will allow the user to choose the return value? Having the option of Rle's is still nice, for some use cases. On Tue, Feb 11, 2014 at 7:12 PM, Michael Lawrence < lawrence.michael at gene.com> wrote:
Just a thought: support coverage calculation directly to IntegerList. Will
very often be faster than RleList, especially when limiting to regions
without long runs of zeros, and with WGS data.
Something to put on the TODO list?
Michael
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319