-----Original Message-----
From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
project.org] On Behalf Of Herv? Pag?s
Sent: Thursday, March 15, 2012 1:55 PM
To: Kasper Daniel Hansen
Cc: bioc-devel at r-project.org
Subject: Re: [Bioc-devel] mapping between original and reduced ranges
Hi reducers,
I agree it "feels wrong" to use findOverlaps() to extract the mapping
from original to reduced ranges. Even if it can be computed very easily
with:
findOverlaps(gr, reduce(gr), select="first")
(Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
the correct result if 'gr' is already sorted by increasing order.)
I think it would be easy for reduce() internal code to produce this
mapping. The question is: how do we give it back to the user?
Is it OK to use an attribute for this? reduce() already uses this
for returning some extra information about the reduction:
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 12 16 5
[4] 24 28 5
[5] 27 31 5
> ir2 <- reduce(ir, with.inframe.attrib=TRUE)
> ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 11 15 5
[4] 16 20 5
[5] 19 23 5
We could to the same thing for the mapping from original to reduced
ranges with e.g. an argument called 'with.mapping.attrib'.
Would that work?
Cheers,
H.
On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
So the key question is to what extent keeping track of where the
ranges comes from would slow down the reduce operation. I am not
familiar enough with the algorithm to know this, but given how fast
IRanges is in general, I am not one for guessing on this.
I agree with Florian that this is a very typical use case.
Kasper
On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
Hi all,
It is true that this is not terribly slow when you deal with fairly large
range objects:
foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)),
system.time(bar<- reduce(foo))
user system elapsed
0.918 0.174 1.091
system.time(foobar<- findOverlaps(foo, bar))
user system elapsed
2.051 0.402 2.453
However the whole process does take about 3x the time of just the
operation, and in my use case I want this to happen interactively, where
waiting 3 seconds compared to 1 makes a huge difference...
I wouldn't push this high up on the development agenda, but it seems to
something that is already 95% existing and could easily be added. But
maybe I am wrong...
Florian
Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com
On 3/14/12 9:40 PM, "Kasper Daniel
Hansen"<kasperdanielhansen at gmail.com>
We have discussed this a couple of times. I routinely uses the reduce
followed by findOverlaps paradigm. As Malcolm says it feels wrong,
but from a practical point of view it is pretty fast, so I stopped
worrying about it. I only think there is a reason to do this, if it
is substantially faster.
Kasper
On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
Chiming in....
on a similar note....
A version of `disjoin` which returns a Hits/RangesMapping additional to
the GRanges result would be most useful and probably not require
additional effort (assuming `disjoin` computes this internally)
Of course, it is easy to live without since I can just perform the
findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
Ahoy!
~Malcolm
-----Original Message-----
From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-
project.org] On Behalf Of Hahne, Florian
Sent: Wednesday, March 14, 2012 2:22 PM
To: bioc-devel at r-project.org
Subject: [Bioc-devel] mapping between original and reduced ranges
This bounced before, guess the mailing list does not like HTML mails.
So
one more try:
I had the following offline discussion with Michael about how one
retain a mapping of the ranges in a GRanges object before and after
reduce. He suggested to take it to the list. Is that something that
could
be added to GenomicRanges/IRanges?
Florian
I have a slightly tricky application for which I need to reduce a
GRanges
object, but I would like to be able to process some of the original
elementMetadata of the merged ranges later. The only way I was
figure out which of the original ranges correspond to the merged
was to perform a findOverlaps operation, but of course that is rather
costly. Is there a way to get the merge information out of the original
reduce call?
Here is a brief example:
gr<- GRanges(seqnames="chr1",
ranges=IRanges(start=c(1,6,12,24,27),
width=5), foo=1:5, bar=letters[1:5])
gr2<- reduce(gr, min.gapwidth=1)
ind<- queryHits(findOverlaps(gr2, gr))
split(values(gr), ind)
Unfortunately, this is the idiom. I could see an improvement where
reduce
or a similarly named function would return a Hits object (in addition
to
the actual reduce result) that would indicate the mapping between
input and reduced ranges. The RangesMapping structure would be
close to what we would need.
Michael