An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140601/5e65516c/attachment.pl>
[Bioc-devel] viewMedians
9 messages · Hervé Pagès, Michael Lawrence, Peter Haverty
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140602/a5ae21b5/attachment.pl>
Hi Peter, Seems like you have a pretty good implementation of the view* functions in genoset. Nice work! And great to hear that there is so much room for improvements to the implementation currently in IRanges. I'll try to give this a shot soon but first I want to move Rle's to the S4Vectors package. Cheers, H.
On 06/01/2014 07:58 PM, Peter Haverty wrote:
I think viewMedians would be great. While you have the hood up, there are some opportunities for some speedups and code simplification, I believe. I did some experimentation with view* in the genoset package. I made an alternate version of the C for viewMeans and found about a 10X speedup. I hoisted the branching for the different types and did the NA handling with arithmetic rather than branching. The search for the Rle runs covered by each view is now done with findInterval. There are quite a few code sections that differ only in the type of the NA value and the pointers to the input/output vectors. I think it would be worth considering C++ templates. On the R side, each view* function is pretty similar too. In genoset/R/RleDataFrame-views.R I tried to factor out all the shared pieces. While we're on the topic, I think the view* functions should have range* equivalents that skip the View object and work on an Rle and an IRanges. If you already have a Views object around, view* are perfect. Otherwise, making the Views objects uses time that could be saved. Overall I found about a 90X speedup over viewMeans(RleViewsList). I hope there is some useful food for thought in these experiments. I have a vignette that shows some of the timings if anyone is interested. Regards, Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140602/9e8eb3fd/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140602/27c3c6fc/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140602/5111fd62/attachment.pl>
There is a lot going on with respect to the view* stuff, and yes, it's
not just about Rle's but they also need to work on atomic vectors.
Right now min(), max(), sum(), and mean() all work on IntegerList,
NumericList, RleList, XIntegerViews and XDoubleViews but
implementations are disparate and share almost nothing. Comparing
for example the sum,XIntegerViews, sum,CompressedIntegerList, and
sum,SimpleIntegerList methods:
library(XVector)
set.seed(33)
subject <- sample(50, 5400200, replace=TRUE)
## XIntegerViews:
xiv <- successiveViews(as(subject, "XInteger"), width=rep(200,
5400200/200))
## CompressedIntegerList:
cil <- extractList(subject, ranges(xiv))
## SimpleIntegerList:
sil <- IntegerList(unname(split(subject, togroup(ranges(xiv)))),
compress=FALSE)
Then:
> system.time(res1 <- sum(xiv))
user system elapsed
0.008 0.000 0.008
> system.time(res2 <- sum(cil))
user system elapsed
0.488 0.004 0.492
> system.time(res3 <- sum(sil))
user system elapsed
0.036 0.000 0.034
The 3 methods share zero code. sum,XIntegerViews is implemented in
C while sum,CompressedIntegerList and sum,SimpleIntegerList are
implemented in R. Just an example.
All this need to be revisited. This is actually one of my goals for
BioC 3.0. viewMedians() on RleViews is just the tip of the iceberg.
H.
On 06/02/2014 01:24 PM, Michael Lawrence wrote:
While we rework things, what about adding support for atomic vectors, in
addition to Rles? Also, what about functions that are optimized for
partitionings? Those would be easy to write and would let us greatly
accelerate e.g. sum,CompressedIntegerList. Right now we rely on rowsum()
which is fast but could be much faster.
Michael
On Mon, Jun 2, 2014 at 10:48 AM, Herv? Pag?s <hpages at fhcrc.org
<mailto:hpages at fhcrc.org>> wrote:
Hi Peter,
Seems like you have a pretty good implementation of the view* functions
in genoset. Nice work! And great to hear that there is so much room for
improvements to the implementation currently in IRanges. I'll try to
give this a shot soon but first I want to move Rle's to the S4Vectors
package.
Cheers,
H.
On 06/01/2014 07:58 PM, Peter Haverty wrote:
I think viewMedians would be great. While you have the hood up,
there are
some opportunities for some speedups and code simplification, I
believe.
I did some experimentation with view* in the genoset package. I
made an
alternate version of the C for viewMeans and found about a 10X
speedup. I
hoisted the branching for the different types and did the NA
handling with
arithmetic rather than branching. The search for the Rle runs
covered by
each view is now done with findInterval. There are quite a few code
sections that differ only in the type of the NA value and the
pointers to
the input/output vectors. I think it would be worth considering C++
templates.
On the R side, each view* function is pretty similar too. In
genoset/R/RleDataFrame-views.R I tried to factor out all the
shared pieces.
While we're on the topic, I think the view* functions should
have range*
equivalents that skip the View object and work on an Rle and an
IRanges.
If you already have a Views object around, view* are perfect.
Otherwise,
making the Views objects uses time that could be saved.
Overall I found about a 90X speedup over viewMeans(RleViewsList).
I hope there is some useful food for thought in these
experiments. I have a
vignette that shows some of the timings if anyone is interested.
Regards,
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com <mailto:phaverty at gene.com>
[[alternative HTML version deleted]]
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
_________________________________________________
Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140603/0f271fa9/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20140603/5f030420/attachment.pl>