[Bioc-devel] mapping between original and reduced ranges
Hi reducers,
I agree it "feels wrong" to use findOverlaps() to extract the mapping
from original to reduced ranges. Even if it can be computed very easily
with:
findOverlaps(gr, reduce(gr), select="first")
(Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
the correct result if 'gr' is already sorted by increasing order.)
I think it would be easy for reduce() internal code to produce this
mapping. The question is: how do we give it back to the user?
Is it OK to use an attribute for this? reduce() already uses this
for returning some extra information about the reduction:
> ir
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 12 16 5
[4] 24 28 5
[5] 27 31 5
> ir2 <- reduce(ir, with.inframe.attrib=TRUE)
> ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
> attr(ir2, "inframe")
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 11 15 5
[4] 16 20 5
[5] 19 23 5
We could to the same thing for the mapping from original to reduced
ranges with e.g. an argument called 'with.mapping.attrib'.
Would that work?
Cheers,
H.
On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
So the key question is to what extent keeping track of where the ranges comes from would slow down the reduce operation. I am not familiar enough with the algorithm to know this, but given how fast IRanges is in general, I am not one for guessing on this. I agree with Florian that this is a very typical use case. Kasper On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi all, It is true that this is not terribly slow when you deal with fairly large range objects: foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE), ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)), width=50)) system.time(bar<- reduce(foo)) user system elapsed 0.918 0.174 1.091 system.time(foobar<- findOverlaps(foo, bar)) user system elapsed 2.051 0.402 2.453 However the whole process does take about 3x the time of just the reduce operation, and in my use case I want this to happen interactively, where waiting 3 seconds compared to 1 makes a huge difference... I wouldn't push this high up on the development agenda, but it seems to be something that is already 95% existing and could easily be added. But maybe I am wrong... Florian Florian Hahne Novartis Institute For Biomedical Research Translational Sciences / Preclinical Safety / PCS Informatics Expert Data Integration and Modeling Bioinformatics CHBS, WKL-135.2.26 Novartis Institute For Biomedical Research, Werk Klybeck Klybeckstrasse 141 CH-4057 Basel Switzerland Phone: +41 61 6967127 Email : florian.hahne at novartis.com On 3/14/12 9:40 PM, "Kasper Daniel Hansen"<kasperdanielhansen at gmail.com> wrote:
We have discussed this a couple of times. I routinely uses the reduce followed by findOverlaps paradigm. As Malcolm says it feels wrong, but from a practical point of view it is pretty fast, so I stopped worrying about it. I only think there is a reason to do this, if it is substantially faster. Kasper On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org> wrote:
Chiming in.... on a similar note.... A version of `disjoin` which returns a Hits/RangesMapping additional to the GRanges result would be most useful and probably not require much additional effort (assuming `disjoin` computes this internally) Of course, it is easy to live without since I can just perform the findOverlaps myself after the disjoin.... it just "feels wrong" (tm) Ahoy! ~Malcolm
-----Original Message----- From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r- project.org] On Behalf Of Hahne, Florian Sent: Wednesday, March 14, 2012 2:22 PM To: bioc-devel at r-project.org Subject: [Bioc-devel] mapping between original and reduced ranges This bounced before, guess the mailing list does not like HTML mails. So one more try: I had the following offline discussion with Michael about how one could retain a mapping of the ranges in a GRanges object before and after reduce. He suggested to take it to the list. Is that something that could be added to GenomicRanges/IRanges? Florian I have a slightly tricky application for which I need to reduce a GRanges object, but I would like to be able to process some of the original elementMetadata of the merged ranges later. The only way I was able to figure out which of the original ranges correspond to the merged ranges was to perform a findOverlaps operation, but of course that is rather costly. Is there a way to get the merge information out of the original reduce call? Here is a brief example: gr<- GRanges(seqnames="chr1", ranges=IRanges(start=c(1,6,12,24,27), width=5), foo=1:5, bar=letters[1:5]) gr2<- reduce(gr, min.gapwidth=1) ind<- queryHits(findOverlaps(gr2, gr)) split(values(gr), ind) Unfortunately, this is the idiom. I could see an improvement where reduce or a similarly named function would return a Hits object (in addition to the actual reduce result) that would indicate the mapping between the input and reduced ranges. The RangesMapping structure would be really close to what we would need. Michael
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319