[Bioc-devel] mapping between original and reduced ranges
Hi Herve, the two are related but not similar issues:
On 3/16/12 6:07 AM, "Herv? Pag?s" <hpages at fhcrc.org> wrote:
On 03/15/2012 04:13 PM, Hahne, Florian wrote:
Would such a solution also allow to keep the original elementMetadata in the respective list representation?
Let see if I understand your question correctly. IIUC Michael is proposing to store the "reverse mapping" (i.e. the mapping from the reduced ranges to the original ranges) in the elementMetadata slot of the reduced object. Note that, unlike the "direct mapping" (i.e. the mapping from the original ranges to the reduced ranges), which is a many-to-1 relationship, the "reverse mapping is 1-to-many, hence the need to use an IntegerList to represent it. The good thing about this IntegerList is that it has the length of the reduced object so it fits naturally in its elementMetadata slot:
> ir
IRanges of length 5
start end width
[1] 24 28 5
[2] 27 31 5
[3] 1 5 5
[4] 6 10 5
[5] 12 16 5
> ir2 <- reduce(ir) > ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
> hits <- findOverlaps(ir, ir2) > rmap <- IntegerList(split(queryHits(hits), subjectHits(hits))) > rmap
CompressedIntegerList of length 3 [["1"]] 3 4 [["2"]] 5 [["3"]] 1 2 The "direct mapping" is a simpler object (just an integer vector) but it has the length of the original object so it cannot be stored in the elementMetadata of the reduced object. It has to go somewhere else (attribute or metadata). A good thing about storing the "reverse mapping" in the elementMetadata slot is that there is no risk of "getting in the way" (i.e. clash with other stuff in that slot) because reduce() drops the original elementMetadata anyway. It also plays nice with subsetting the reduced object and with the vectorized behaviour of the "reduce" method for GRangesList. The only small con I see is that, like in the use case you show in your original post, the user might actually need the direct mapping, not the reverse one. However it's not too hard to reverse a mapping. There is actually a function in Biobase for doing this on a regular list:
> as.integer(reverseSplit(as.list(rmap)))
[1] 3 3 1 1 2
> findOverlaps(ir, reduce(ir), select="first")
[1] 3 3 1 1 2 There is also revmap() in AnnotationDbi that does the same on Bimap objects. It also works on a regular list:
> as.integer(revmap(as.list(rmap)))
[1] 3 3 1 1 2 We could move the revmap() generic from AnnotationDbi to IRanges (or to BiocGenerics) and add a method for IntegerList objects. I think this is actually my preferred solution so far.
This sounds perfect to me and absolutely handles the mapping problem
Back to your question: Would such a solution also allow to keep the original elementMetadata in the respective list representation? Something like "folding" the original elementMetadata in a way that makes it fit in the elementMetadata of the reduced object? The original elementMetadata (a DataFrame) could actually be split in a SplitDataFrameList (i.e. conceptually a list of DataFrame's that all have the same columns), and this SplitDataFrameList stored in the elementMetadata of the reduced object. It would be stored as a single column though since elementMetadata must be DataFrame, it cannot be a SplitDataFrameList. That solution feels "heavy" to me i.e. it involves complex data structures that are not so easy to manipulate. I still like Michael's proposal better, it's much lighter.
The reason I was asking this is that with the mapping I would actually split the old elementMetadata and do some processing on it. So in addition to being able to add the mapping I thought it would be kind of useful to retain the original metadata in a "folded" structure similar to the IntegerList approach Michael is proposing. Essentially one would collapse each of the existing elementMetadata columns in a list representation. At first I found it a bit surprising that all my elementMetadata was gone after reduce, but after some thinking it actually makes sense, since the original ranges are gone, too, and having a reduced/folded elementMetadata slot in the default output would be very confusing. I was just thinking to have this as an option, mainly because I can see this as a general use case, and It would make subsequent operations a little more coherent compared to having a folded elementMetadata representation and the reduced ranges in two separate objects.
Cheers, H.
I assume that creates about the same overhead as keeping the index? Florian Hahne Novartis Institute For Biomedical Research Translational Sciences / Preclinical Safety / PCS Informatics Expert Data Integration and Modeling Bioinformatics CHBS, WKL-135.2.26 Novartis Institute For Biomedical Research, Werk Klybeck Klybeckstrasse 141 CH-4057 Basel Switzerland Phone: +41 61 6967127 Email : florian.hahne at novartis.com On 3/15/12 11:58 PM, "Michael Lawrence"<lawrence.michael at gene.com> wrote:
I would be in favor of either the attribute or metadata solution. I could see having an IntegerList element in the element metadata that indicates the original ranges that were reduced into the returned range, or a Hits object in the top-level metadata. A plus and minus of the metadata approach is that it is more familiar to the users than hiding stuff in attributes, which is pretty low-level. However, using the metadata will increase the probabilty of "getting in the way". The user does need to explicitly request it though. Michael On Thu, Mar 15, 2012 at 3:26 PM, Herv? Pag?s<hpages at fhcrc.org> wrote:
On 03/15/2012 02:40 PM, Kasper Daniel Hansen wrote:
I'll vote against the attribute solution and for a solution where the type of return object gets changed, for example into a list.
Thanks for voting!
Problem with this is how you handle 'with.mapping=TRUE' when the input
is GRangesList. Do you return
(1) a list of the same length as the input 'x', where the i-th
top-level element is itself the 2-element list returned
by reduce(x[[i]], with.mapping=TRUE)
(2) or a 2-element list where 1 element is the reduced GRangesList
and the other element is an IntegerList representing the
list of mappings?
(1) would be very inefficient because the returned object would need
to be populated with hundreds of thousands of S4 instances.
(2) disrupts too much how reduce() is expected to behave on a
GRangesList object i.e. it's expected to operate in a "vectorized"
fashion, that is, each top-level element in the input is reduced
independently of the others and all the results are stored in a
list-like object of the *same length* as the input. So we have nice
properties like
reduce(x, ...)[[i]] is identical to reduce(x[[i]], ...)
Here that would not be the case anymore :-/
More generally speaking, I would not give up on the "endomorphism"
nature of reduce() so easily. It gives us good things like for
example its behaviour on a GRangesList object can be explained
as easily as with
endoapply(x, reduce, ....)
*whatever* arguments/parameters/toggles are passed to it. This
makes the documentation *much* easier to write and also it makes
writing unit test much easier.
So if we really want to go for the list solution, I would suggest
that this is done outside reduce() e.g. in reduceAndMap() or
something like that.
Cheers,
H.
Kasper 2012/3/15 Herv? Pag?s<hpages at fhcrc.org>:
On 03/15/2012 12:45 PM, Cook, Malcolm wrote:
Hi Herve, I've not used attributes to return values before. I guess it would work, and I won't object further if you do it this way, but, since you asked Again, it "feels wrong" in violating functional I suspect there may be issues with memory management. When does the attribute get gc-ed? When the object does? If so, then, retaining the attribute in memory when not needed _could_ be a burden, no? Back in my lisp days, this is when I would use `values` and `multiple-value-bind` (and friends) when I wanted a function to (optionally) return multiple values. But this is R. Would you consider returning instead a list of values, keyed by `value` and `hits`, but only when with.hits BTW: with.inframe.attrib is documented as 'For internal use'. What does it return in the attr?
AFAIK, it's only supported by the "reduce" methods for IRanges objects. The "inframe" attribute contains an IRanges object of the same length as the input. For each range in the input it tells you the position of that range with respect to the "frame" i.e. the space obtained by pasting together the ranges in the reduce object:
> ir
IRanges of length 5
start end width
[1] 24 28 5
[2] 27 31 5
[3] 1 5 5
[4] 6 10 5
[5] 12 16 5
> ir2<- reduce(ir, with.inframe.attrib=TRUE) > ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
> attr(ir2, "inframe")
IRanges of length 5
start end width
[1] 16 20 5
[2] 19 23 5
[3] 1 5 5
[4] 6 10 5
[5] 11 15 5
1 1 2 2 3
1...5....0....5....0....5....**0.<- standard coordinate
system
ir[1] xxxxx
ir[2] xxxxx
ir[3] xxxxx
ir[4] xxxxx
ir[5] xxxxx
ir2: xxxxxxxxxx xxxxx xxxxxxxx
1...5....1 ....1 ....2...<- "frame" coordinate system
0 5 0
I'll document this.
H.
Thanks for listening! ~Malcolm -----Original Message-----
From: bioc-devel-bounces at r-project.**org<bioc-devel-bounces at r-project.org
[
mailto:
bioc-devel-bounces at r-
project.org] On Behalf Of Herv? Pag?s
Sent: Thursday, March 15, 2012 1:55 PM
To: Kasper Daniel Hansen
Cc: bioc-devel at r-project.org
Subject: Re: [Bioc-devel] mapping between original and reduced
ranges
Hi reducers,
I agree it "feels wrong" to use findOverlaps() to extract the
mapping
from original to reduced ranges. Even if it can be computed very
easily
with:
findOverlaps(gr, reduce(gr), select="first")
(Note that using 'queryHits(findOverlaps(**reduce(gr), gr))' only
produces
the correct result if 'gr' is already sorted by increasing order.)
I think it would be easy for reduce() internal code to produce
this
mapping. The question is: how do we give it back to the user?
Is it OK to use an attribute for this? reduce() already uses this
for returning some extra information about the reduction:
> ir
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 12 16 5
[4] 24 28 5
[5] 27 31 5
> ir2<- reduce(ir, with.inframe.attrib=TRUE)
> ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
> attr(ir2, "inframe")
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 11 15 5
[4] 16 20 5
[5] 19 23 5
We could to the same thing for the mapping from original to
reduced
ranges with e.g. an argument called 'with.mapping.attrib'.
Would that work?
Cheers,
H.
On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
So the key question is to what extent keeping track of where the ranges comes from would slow down the reduce operation. I am not familiar enough with the algorithm to know this, but given how fast IRanges is in general, I am not one for guessing on this. I agree with Florian that this is a very typical use case. Kasper On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian <florian.hahne at novartis.com> wrote:
Hi all, It is true that this is not terribly slow when you deal with fairly large range objects: foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE), ranges=IRanges(start=as.**integer(runif(min=1, max=1e7, n=1e6)),
width=50))
system.time(bar<- reduce(foo))
user system elapsed
0.918 0.174 1.091
system.time(foobar<- findOverlaps(foo, bar))
user system elapsed
2.051 0.402 2.453
However the whole process does take about 3x the time of just
the
reduce
operation, and in my use case I want this to happen interactively, where waiting 3 seconds compared to 1 makes a huge difference... I wouldn't push this high up on the development agenda, but it seems to
be
something that is already 95% existing and could easily be added. But maybe I am wrong... Florian Florian Hahne Novartis Institute For Biomedical Research Translational Sciences / Preclinical Safety / PCS Informatics Expert Data Integration and Modeling Bioinformatics CHBS, WKL-135.2.26 Novartis Institute For Biomedical Research, Werk Klybeck Klybeckstrasse 141 CH-4057 Basel Switzerland Phone: +41 61 6967127 Email : florian.hahne at novartis.com On 3/14/12 9:40 PM, "Kasper Daniel
Hansen"<kasperdanielhansen@**gmail.com <kasperdanielhansen at gmail.com>>
wrote: We have discussed this a couple of times. I routinely uses the
reduce followed by findOverlaps paradigm. As Malcolm says it feels wrong, but from a practical point of view it is pretty fast, so I stopped worrying about it. I only think there is a reason to do this, if it is substantially faster. Kasper On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
wrote:
Chiming in.... on a similar note.... A version of `disjoin` which returns a Hits/RangesMapping additional to the GRanges result would be most useful and probably not require
much
additional effort (assuming `disjoin` computes this internally) Of course, it is easy to live without since I can just perform the findOverlaps myself after the disjoin.... it just "feels wrong" (tm) Ahoy! ~Malcolm -----Original Message-----
From: bioc-devel-bounces at r-project.**org<bioc-devel-bounces at r-projec t. org>[mailto: bioc-devel-
bounces at r-
project.org] On Behalf Of Hahne, Florian Sent: Wednesday, March 14, 2012 2:22 PM To: bioc-devel at r-project.org Subject: [Bioc-devel] mapping between original and reduced ranges This bounced before, guess the mailing list does not like HTML mails. So one more try: I had the following offline discussion with Michael about how one
could
retain a mapping of the ranges in a GRanges object before and after reduce. He suggested to take it to the list. Is that something that could be added to GenomicRanges/IRanges? Florian I have a slightly tricky application for which I need to reduce a GRanges object, but I would like to be able to process some of the original elementMetadata of the merged ranges later. The only way I was
able to
figure out which of the original ranges correspond to the merged
ranges
was to perform a findOverlaps operation, but of course that is rather costly. Is there a way to get the merge information out of the original reduce call? Here is a brief example: gr<- GRanges(seqnames="chr1",
ranges=IRanges(start=c(1,6,12,**24,27),
width=5), foo=1:5, bar=letters[1:5]) gr2<- reduce(gr, min.gapwidth=1) ind<- queryHits(findOverlaps(gr2, gr)) split(values(gr), ind) Unfortunately, this is the idiom. I could see an improvement where reduce or a similarly named function would return a Hits object (in addition to the actual reduce result) that would indicate the mapping between
the
input and reduced ranges. The RangesMapping structure would be
really
close to what we would need. Michael
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://sta t. ethz.ch/mailman/listinfo/bioc-devel>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat .e thz.ch/mailman/listinfo/bioc-devel>
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.et hz .ch/mailman/listinfo/bioc-devel>
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.eth z. ch/mailman/listinfo/bioc-devel>
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
______________________________**_________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch /m ailman/listinfo/bioc-devel>
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319