Hello. I am planning to develop a new package which extends the GenomicInteractions package. I would like some help/advice on implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
[2] chrA 1-2 --- chrA 15-16 | 1
[3] chrA 3-4 --- chrA 3-4 | 1
[4] chrA 5-6 --- chrA 7-8 | 1
[5] chrA 5-6 --- chrA 9-10 | 1
[6] chrA 7-8 --- chrA 7-8 | 1
[7] chrA 7-8 --- chrA 11-12 | 1
[8] chrA 7-8 --- chrA 17-18 | 1
[9] chrA 9-10 --- chrA 9-10 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Quadtree> with the final group being WxW (instead of 2x2). This will involve
- iteratively dividing the matrix into quadrants {upper-left (0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer>
[1] chrA 1-2 --- chrA 9-10 | 1 0 1
[2] chrA 1-2 --- chrA 15-16 | 1 1 0
[3] chrA 3-4 --- chrA 3-4 | 1 0 0
[4] chrA 5-6 --- chrA 7-8 | 1 0 1
[5] chrA 5-6 --- chrA 9-10 | 1 0 1
[6] chrA 7-8 --- chrA 7-8 | 1 0 3
[7] chrA 7-8 --- chrA 11-12 | 1 0 3
[8] chrA 7-8 --- chrA 17-18 | 1 1 2
[9] chrA 9-10 --- chrA 9-10 | 1 0 3
[10] chrA 9-10 --- chrA 15-16 | 1 1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting by the two columns yields what I am after. Of course, I include the ?quadX? column for illustration only. Upon implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer>
[1] chrA 3-4 --- chrA 3-4 | 1 0 0
[2] chrA 1-2 --- chrA 9-10 | 1 0 1
[3] chrA 5-6 --- chrA 7-8 | 1 0 1
[4] chrA 5-6 --- chrA 9-10 | 1 0 1
[5] chrA 7-8 --- chrA 7-8 | 1 0 3
[6] chrA 7-8 --- chrA 11-12 | 1 0 3
[7] chrA 9-10 --- chrA 9-10 | 1 0 3
[8] chrA 1-2 --- chrA 15-16 | 1 1 0
[9] chrA 7-8 --- chrA 17-18 | 1 1 2
[10] chrA 9-10 --- chrA 15-16 | 1 1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique quadrant sequence defines the group.
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
[2] chrA 1-2 --- chrA 9-10 | 1
[3] chrA 5-6 --- chrA 7-8 | 1
[4] chrA 5-6 --- chrA 9-10 | 1
[5] chrA 7-8 --- chrA 7-8 | 1
[6] chrA 7-8 --- chrA 11-12 | 1
[7] chrA 9-10 --- chrA 9-10 | 1
[8] chrA 1-2 --- chrA 15-16 | 1
[9] chrA 7-8 --- chrA 17-18 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions), expanding the ranges accordingly and adding the counts.. This process will
- ***identify all range-pairs in the same window and merge them into a new range pair with appropriately expanded ranges*** (this is my primary goal)
- sum the counts for each of the aforementioned range-pairs (i have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-6 --- chrA 1-6 | 1
[2] chrA 1-6 --- chrA 7-12 | 3
[3] chrA 7-12 --- chrA 7-12 | 3
[4] chrA 1-6 --- chrA 13-18 | 1
[5] chrA 7-12 --- chrA 13-18 | 2
-------
regions: 3 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is 6, though the counts will only change if there exists another subrange covered by this bin/expansion that contains a positive count.
As always, speed in a concern.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
[Bioc-devel] Merging GInteraction/GenomicInteractions ranges
6 messages · Luke Klein, Shian Su, Liz Ing-Simmons +1 more
Note that your visual won't show up for many (all?) of us. Nonetheless, I think I know what you want to do. Your task does not lend itself to vectorization, which makes it difficult to write efficient R code. It's not impossible, but it would be quite hard to read and debug, and your maintenance programmer will be cursing you somewhere down the line. If speed is truly a concern, I would write this code in C++. This would probably be several lines' worth of code: 1. Compute a pair of bin IDs for each interaction by dividing each anchor coordinate by the bin width and truncating the result. (You'll need to decide if you want to use the midpoint/start/end, etc.) 2. Sort the interactions by the paired bin IDs, e.g., with std::sort. 3. Identify each "run" of interactions with the same paired IDs. 4. Repeat step 1 within each run (you'll need to offset the anchor coordinate before dividing this time). Append the current quadrant to the quadrant sequence for return to R at the end of recursion. Clear, concise, and can be slapped together in less than half an hour with Rcpp and C++11, if you know what you're doing. -A
On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
Hello.??I am planning to develop a new package which extends the GenomicInteractions package.??I would like some help/advice on implementing the following functionality. Consider the follow GenomicInteractions object GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ? ------- ? regions: 8 ranges and 0 metadata columns ? seqinfo: 1 sequence from an unspecified genome; no seqlengths Which is visually represented thusly I would like to do the following: 1) I want to group the regions into bins of WxW (in this case, W will be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Qua dtree> with the final group being WxW (instead of 2x2).??This will involve? - iteratively dividing the matrix into quadrants {upper-left (0), upper-right (1), lower-left (2), lower-right(3)} . - labeling each subdivision in a new column until the final WxW resolution is reached. - sorting by the columns GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts?????quad1?????quad2 ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> <integer> <integer> ???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???????1 ?0 ???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???????0 ?0 ???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???????0 ?1 ???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???????0 ?3 ? ???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???????0 ?3 ???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???????1 ?2 ???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???????0 ?3 ? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ???????1 ?2 ? ------- ? regions: 8 ranges and 0 metadata columns ? seqinfo: 1 sequence from an unspecified genome; no seqlengths Sorting by the two columns yields what I am after.??Of course, I include the ?quadX? column for illustration only.??Upon implementation, I would like these columns hidden from the user. GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts?????quad1?????quad2 ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> <integer> <integer> ???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???????0 ?0 ???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???????0 ?1 ???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???????0 ?3 ???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???????0 ?3 ? ???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???????0 ?3 ???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???????1 ?0 ???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???????1 ?2 ? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ???????1 ?2 ? ------- ? regions: 8 ranges and 0 metadata columns ? seqinfo: 1 sequence from an unspecified genome; no seqlengths The sorting gives me the quad-tree structure, and each unique quadrant sequence defines the group. ?? GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ? ------- ? regions: 8 ranges and 0 metadata columns ? seqinfo: 1 sequence from an unspecified genome; no seqlengths 2) Then I would like to merge the WxW window (i.e. bin the regions), expanding the ranges accordingly and adding the counts..??This process will - ***identify all range-pairs in the same window and merge them into a new range pair with appropriately expanded ranges*** (this is my primary goal) - sum the counts for each of the aforementioned range-pairs (i have already figured a way to do this) GenomicInteractions object with 5 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????1-6??---??????chrA??????1-6??|?????????1 ???[2]??????chrA???????1-6??---??????chrA??????7-12 |?????????3 ???[3]??????chrA???????7-12 ---??????chrA??????7-12 |?????????3 ???[4]??????chrA???????1-6??---??????chrA?????13-18 |?????????1 ???[5]??????chrA???????7-12 ---??????chrA?????13-18 |?????????2 ? ------- ? regions: 3 ranges and 0 metadata columns ? seqinfo: 1 sequence from an unspecified genome; no seqlengths NOTE that ranges1 and ranges2 MUST expand so that the region width is 6, though the counts will only change if there exists another subrange covered by this bin/expansion that contains a positive count. As always, speed in a concern. Best, ? Luke Klein ????PhD Student ????Department of Statistics ????University of California, Riverside ????lklei001 at ucr.edu
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
1 day later
Hi Luke, I also don?t see the visual representation. It sounds like you want to summarise over disjoint boxes in the interaction heat map. It also sounds like your strategy is to use quad-trees, truncated at a certain depth with all the children accumulated to that depth. If so then what happens if you have a region that crosses two quads of the level you try to summarise at? What about the pathological case of a box right in the middle of your matrix? Do you have a readily available implementation of a rectangle quad-tree? This is more complicated than the point quad-tree, significantly more so if it?s to be cache-efficient and parallel. I would suggest you start with naive box intersections, it?s trivial to compute each summarising box in parallel for speed. If this approach is too slow then maybe you can have a try at the quad-tree. After all, you?ll want to compare the quad-tree to something to actually make sure it?s ?fast?. Kind regards, Shian
On 13 Feb 2019, at 10:34 pm, Aaron Lun <infinite.monkeys.with.keyboards at gmail.com> wrote: Note that your visual won't show up for many (all?) of us. Nonetheless, I think I know what you want to do. Your task does not lend itself to vectorization, which makes it difficult to write efficient R code. It's not impossible, but it would be quite hard to read and debug, and your maintenance programmer will be cursing you somewhere down the line. If speed is truly a concern, I would write this code in C++. This would probably be several lines' worth of code: 1. Compute a pair of bin IDs for each interaction by dividing each anchor coordinate by the bin width and truncating the result. (You'll need to decide if you want to use the midpoint/start/end, etc.) 2. Sort the interactions by the paired bin IDs, e.g., with std::sort. 3. Identify each "run" of interactions with the same paired IDs. 4. Repeat step 1 within each run (you'll need to offset the anchor coordinate before dividing this time). Append the current quadrant to the quadrant sequence for return to R at the end of recursion. Clear, concise, and can be slapped together in less than half an hour with Rcpp and C++11, if you know what you're doing. -A On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
Hello. I am planning to develop a new package which extends the
GenomicInteractions package. I would like some help/advice on
implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
[2] chrA 1-2 --- chrA 15-16 | 1
[3] chrA 3-4 --- chrA 3-4 | 1
[4] chrA 5-6 --- chrA 7-8 | 1
[5] chrA 5-6 --- chrA 9-10 | 1
[6] chrA 7-8 --- chrA 7-8 | 1
[7] chrA 7-8 --- chrA 11-12 | 1
[8] chrA 7-8 --- chrA 17-18 | 1
[9] chrA 9-10 --- chrA 9-10 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will
be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Qua
dtree> with the final group being WxW (instead of 2x2). This will
involve
- iteratively dividing the matrix into quadrants {upper-left
(0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW
resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
0 1
[2] chrA 1-2 --- chrA 15-16 | 1
1 0
[3] chrA 3-4 --- chrA 3-4 | 1
0 0
[4] chrA 5-6 --- chrA 7-8 | 1
0 1
[5] chrA 5-6 --- chrA 9-10 | 1
0 1
[6] chrA 7-8 --- chrA 7-8 | 1
0 3
[7] chrA 7-8 --- chrA 11-12 | 1
0 3
[8] chrA 7-8 --- chrA 17-18 | 1
1 2
[9] chrA 9-10 --- chrA 9-10 | 1
0 3
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting by the two columns yields what I am after. Of course, I
include the ?quadX? column for illustration only. Upon
implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
0 0
[2] chrA 1-2 --- chrA 9-10 | 1
0 1
[3] chrA 5-6 --- chrA 7-8 | 1
0 1
[4] chrA 5-6 --- chrA 9-10 | 1
0 1
[5] chrA 7-8 --- chrA 7-8 | 1
0 3
[6] chrA 7-8 --- chrA 11-12 | 1
0 3
[7] chrA 9-10 --- chrA 9-10 | 1
0 3
[8] chrA 1-2 --- chrA 15-16 | 1
1 0
[9] chrA 7-8 --- chrA 17-18 | 1
1 2
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique
quadrant sequence defines the group.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
[2] chrA 1-2 --- chrA 9-10 | 1
[3] chrA 5-6 --- chrA 7-8 | 1
[4] chrA 5-6 --- chrA 9-10 | 1
[5] chrA 7-8 --- chrA 7-8 | 1
[6] chrA 7-8 --- chrA 11-12 | 1
[7] chrA 9-10 --- chrA 9-10 | 1
[8] chrA 1-2 --- chrA 15-16 | 1
[9] chrA 7-8 --- chrA 17-18 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions),
expanding the ranges accordingly and adding the counts.. This
process will
- ***identify all range-pairs in the same window and merge them
into a new range pair with appropriately expanded ranges*** (this is
my primary goal)
- sum the counts for each of the aforementioned range-pairs (i
have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-6 --- chrA 1-6 | 1
[2] chrA 1-6 --- chrA 7-12 | 3
[3] chrA 7-12 --- chrA 7-12 | 3
[4] chrA 1-6 --- chrA 13-18 | 1
[5] chrA 7-12 --- chrA 13-18 | 2
-------
regions: 3 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is
6, though the counts will only change if there exists another
subrange covered by this bin/expansion that contains a positive
count.
As always, speed in a concern.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin Nation as the traditional owners of the land where our campuses are located and the continuing connection to country and community. _______________________________________________
4 days later
Thank you for the tip, Aaron! I?m working on the sorting step right now.
One thing I?m not clear on is how to expand the ranges from one step to the next. The way GenomicInteractions are structured, there is a Granges object with all possible ranges, and the GInteractions object is populated by reference to said interactions.
What I am going to need is a new GRanges object with the new set of (expanded) ranges, and a way to map the prior ranges to the new, wider range.
For your edification, I?ve included the images in this email (which is addressed directly to you) so that you can see what I?d written in my first question.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
On Feb 13, 2019, at 3:34 AM, Aaron Lun <infinite.monkeys.with.keyboards at gmail.com> wrote: Note that your visual won't show up for many (all?) of us. Nonetheless, I think I know what you want to do. Your task does not lend itself to vectorization, which makes it difficult to write efficient R code. It's not impossible, but it would be quite hard to read and debug, and your maintenance programmer will be cursing you somewhere down the line. If speed is truly a concern, I would write this code in C++. This would probably be several lines' worth of code: 1. Compute a pair of bin IDs for each interaction by dividing each anchor coordinate by the bin width and truncating the result. (You'll need to decide if you want to use the midpoint/start/end, etc.) 2. Sort the interactions by the paired bin IDs, e.g., with std::sort. 3. Identify each "run" of interactions with the same paired IDs. 4. Repeat step 1 within each run (you'll need to offset the anchor coordinate before dividing this time). Append the current quadrant to the quadrant sequence for return to R at the end of recursion. Clear, concise, and can be slapped together in less than half an hour with Rcpp and C++11, if you know what you're doing. -A On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
Hello. I am planning to develop a new package which extends the
GenomicInteractions package. I would like some help/advice on
implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
[2] chrA 1-2 --- chrA 15-16 | 1
[3] chrA 3-4 --- chrA 3-4 | 1
[4] chrA 5-6 --- chrA 7-8 | 1
[5] chrA 5-6 --- chrA 9-10 | 1
[6] chrA 7-8 --- chrA 7-8 | 1
[7] chrA 7-8 --- chrA 11-12 | 1
[8] chrA 7-8 --- chrA 17-18 | 1
[9] chrA 9-10 --- chrA 9-10 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will
be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Qua <https://en.wikipedia.org/wiki/Qua>
dtree> with the final group being WxW (instead of 2x2). This will
involve
- iteratively dividing the matrix into quadrants {upper-left
(0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW
resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
0 1
[2] chrA 1-2 --- chrA 15-16 | 1
1 0
[3] chrA 3-4 --- chrA 3-4 | 1
0 0
[4] chrA 5-6 --- chrA 7-8 | 1
0 1
[5] chrA 5-6 --- chrA 9-10 | 1
0 1
[6] chrA 7-8 --- chrA 7-8 | 1
0 3
[7] chrA 7-8 --- chrA 11-12 | 1
0 3
[8] chrA 7-8 --- chrA 17-18 | 1
1 2
[9] chrA 9-10 --- chrA 9-10 | 1
0 3
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting by the two columns yields what I am after. Of course, I
include the ?quadX? column for illustration only. Upon
implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
0 0
[2] chrA 1-2 --- chrA 9-10 | 1
0 1
[3] chrA 5-6 --- chrA 7-8 | 1
0 1
[4] chrA 5-6 --- chrA 9-10 | 1
0 1
[5] chrA 7-8 --- chrA 7-8 | 1
0 3
[6] chrA 7-8 --- chrA 11-12 | 1
0 3
[7] chrA 9-10 --- chrA 9-10 | 1
0 3
[8] chrA 1-2 --- chrA 15-16 | 1
1 0
[9] chrA 7-8 --- chrA 17-18 | 1
1 2
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique
quadrant sequence defines the group.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
[2] chrA 1-2 --- chrA 9-10 | 1
[3] chrA 5-6 --- chrA 7-8 | 1
[4] chrA 5-6 --- chrA 9-10 | 1
[5] chrA 7-8 --- chrA 7-8 | 1
[6] chrA 7-8 --- chrA 11-12 | 1
[7] chrA 9-10 --- chrA 9-10 | 1
[8] chrA 1-2 --- chrA 15-16 | 1
[9] chrA 7-8 --- chrA 17-18 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions),
expanding the ranges accordingly and adding the counts.. This
process will
- ***identify all range-pairs in the same window and merge them
into a new range pair with appropriately expanded ranges*** (this is
my primary goal)
- sum the counts for each of the aforementioned range-pairs (i
have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-6 --- chrA 1-6 | 1
[2] chrA 1-6 --- chrA 7-12 | 3
[3] chrA 7-12 --- chrA 7-12 | 3
[4] chrA 1-6 --- chrA 13-18 | 1
[5] chrA 7-12 --- chrA 13-18 | 2
-------
regions: 3 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is
6, though the counts will only change if there exists another
subrange covered by this bin/expansion that contains a positive
count.
As always, speed in a concern.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
Hello. I am planning to develop a new package which extends the GenomicInteractions package. I would like some help/advice on implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
[2] chrA 1-2 --- chrA 15-16 | 1
[3] chrA 3-4 --- chrA 3-4 | 1
[4] chrA 5-6 --- chrA 7-8 | 1
[5] chrA 5-6 --- chrA 9-10 | 1
[6] chrA 7-8 --- chrA 7-8 | 1
[7] chrA 7-8 --- chrA 11-12 | 1
[8] chrA 7-8 --- chrA 17-18 | 1
[9] chrA 9-10 --- chrA 9-10 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Quadtree> with the final group being WxW (instead of 2x2). This will involve
- iteratively dividing the matrix into quadrants {upper-left (0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer>
[1] chrA 1-2 --- chrA 9-10 | 1 0 1
[2] chrA 1-2 --- chrA 15-16 | 1 1 0
[3] chrA 3-4 --- chrA 3-4 | 1 0 0
[4] chrA 5-6 --- chrA 7-8 | 1 0 1
[5] chrA 5-6 --- chrA 9-10 | 1 0 1
[6] chrA 7-8 --- chrA 7-8 | 1 0 3
[7] chrA 7-8 --- chrA 11-12 | 1 0 3
[8] chrA 7-8 --- chrA 17-18 | 1 1 2
[9] chrA 9-10 --- chrA 9-10 | 1 0 3
[10] chrA 9-10 --- chrA 15-16 | 1 1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting by the two columns yields what I am after. Of course, I include the ?quadX? column for illustration only. Upon implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer>
[1] chrA 3-4 --- chrA 3-4 | 1 0 0
[2] chrA 1-2 --- chrA 9-10 | 1 0 1
[3] chrA 5-6 --- chrA 7-8 | 1 0 1
[4] chrA 5-6 --- chrA 9-10 | 1 0 1
[5] chrA 7-8 --- chrA 7-8 | 1 0 3
[6] chrA 7-8 --- chrA 11-12 | 1 0 3
[7] chrA 9-10 --- chrA 9-10 | 1 0 3
[8] chrA 1-2 --- chrA 15-16 | 1 1 0
[9] chrA 7-8 --- chrA 17-18 | 1 1 2
[10] chrA 9-10 --- chrA 15-16 | 1 1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique quadrant sequence defines the group.
GenomicInteractions object with 10 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
[2] chrA 1-2 --- chrA 9-10 | 1
[3] chrA 5-6 --- chrA 7-8 | 1
[4] chrA 5-6 --- chrA 9-10 | 1
[5] chrA 7-8 --- chrA 7-8 | 1
[6] chrA 7-8 --- chrA 11-12 | 1
[7] chrA 9-10 --- chrA 9-10 | 1
[8] chrA 1-2 --- chrA 15-16 | 1
[9] chrA 7-8 --- chrA 17-18 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions), expanding the ranges accordingly and adding the counts.. This process will
- ***identify all range-pairs in the same window and merge them into a new range pair with appropriately expanded ranges*** (this is my primary goal)
- sum the counts for each of the aforementioned range-pairs (i have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-6 --- chrA 1-6 | 1
[2] chrA 1-6 --- chrA 7-12 | 3
[3] chrA 7-12 --- chrA 7-12 | 3
[4] chrA 1-6 --- chrA 13-18 | 1
[5] chrA 7-12 --- chrA 13-18 | 2
-------
regions: 3 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is 6, though the counts will only change if there exists another subrange covered by this bin/expansion that contains a positive count.
As always, speed in a concern.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu <mailto:lklei001 at ucr.edu>
Hi Luke,
If I understand correctly what you want to do, you can use `findOverlaps()`
to create a new GInteractions object from a set of expanded ranges and the
original interactions.
anchor.one <- GRanges(c("chr1", "chr1", "chr1", "chr1"),
IRanges(c(10, 20, 30, 20), width=5))
anchor.two <- GRanges(c("chr1", "chr1", "chr1", "chr2"),
IRanges(c(100, 200, 200, 50), width=5))
interaction_counts <- sample(1:10, 4)
test <- GInteractions(anchor1 = anchor.one,
anchor2 = anchor.two,
counts=interaction_counts)
wider_ranges <- reduce(resize(regions(test), fix = "center", width = 10))
first_ol <- findOverlaps(test, wider_ranges, use.region = "first")
second_ol <- findOverlaps(test, wider_ranges, use.region = "second")
new_gi <- GInteractions(anchor1 = subjectHits(first_ol),
anchor2 = subjectHits(second_ol),
regions = wider_ranges,
counts = test$counts)
This will give you a GInteractions object that's the same length as your
original object. You said you already had a way of summing the counts for
each group of interactions, but you could also incorporate that here. I
would do it with dplyr, but I'm sure there's alternatives that introduce
fewer dependencies...
new_gi_info_df <- data.frame(anchor1 = subjectHits(first_ol),
anchor2 = subjectHits(second_ol),
counts = test$counts) %>%
dplyr::group_by(anchor1, anchor2) %>%
dplyr::summarise(counts = sum(counts))
best wishes,
Liz
On Tue, Feb 19, 2019 at 6:15 PM Luke Klein <lklei001 at ucr.edu> wrote:
Thank you for the tip, Aaron! I?m working on the sorting step right now.
One thing I?m not clear on is how to expand the ranges from one step to
the next. The way GenomicInteractions are structured, there is a Granges
object with all possible ranges, and the GInteractions object is populated
by reference to said interactions.
What I am going to need is a new GRanges object with the new set of
(expanded) ranges, and a way to map the prior ranges to the new, wider
range.
For your edification, I?ve included the images in this email (which is
addressed directly to you) so that you can see what I?d written in my first
question.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
On Feb 13, 2019, at 3:34 AM, Aaron Lun <
infinite.monkeys.with.keyboards at gmail.com> wrote:
Note that your visual won't show up for many (all?) of us. Nonetheless, I think I know what you want to do. Your task does not lend itself to vectorization, which makes it difficult to write efficient R code. It's not impossible, but it would be quite hard to read and debug, and your maintenance programmer will be cursing you somewhere down the line. If speed is truly a concern, I would write this code in C++. This would probably be several lines' worth of code: 1. Compute a pair of bin IDs for each interaction by dividing each anchor coordinate by the bin width and truncating the result. (You'll need to decide if you want to use the midpoint/start/end, etc.) 2. Sort the interactions by the paired bin IDs, e.g., with std::sort. 3. Identify each "run" of interactions with the same paired IDs. 4. Repeat step 1 within each run (you'll need to offset the anchor coordinate before dividing this time). Append the current quadrant to the quadrant sequence for return to R at the end of recursion. Clear, concise, and can be slapped together in less than half an hour with Rcpp and C++11, if you know what you're doing. -A On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
Hello. I am planning to develop a new package which extends the
GenomicInteractions package. I would like some help/advice on
implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
[2] chrA 1-2 --- chrA 15-16 | 1
[3] chrA 3-4 --- chrA 3-4 | 1
[4] chrA 5-6 --- chrA 7-8 | 1
[5] chrA 5-6 --- chrA 9-10 | 1
[6] chrA 7-8 --- chrA 7-8 | 1
[7] chrA 7-8 --- chrA 11-12 | 1
[8] chrA 7-8 --- chrA 17-18 | 1
[9] chrA 9-10 --- chrA 9-10 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will
be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Qua <
dtree> with the final group being WxW (instead of 2x2). This will
involve
- iteratively dividing the matrix into quadrants {upper-left
(0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW
resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 1-2 --- chrA 9-10 | 1
0 1
[2] chrA 1-2 --- chrA 15-16 | 1
1 0
[3] chrA 3-4 --- chrA 3-4 | 1
0 0
[4] chrA 5-6 --- chrA 7-8 | 1
0 1
[5] chrA 5-6 --- chrA 9-10 | 1
0 1
[6] chrA 7-8 --- chrA 7-8 | 1
0 3
[7] chrA 7-8 --- chrA 11-12 | 1
0 3
[8] chrA 7-8 --- chrA 17-18 | 1
1 2
[9] chrA 9-10 --- chrA 9-10 | 1
0 3
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting by the two columns yields what I am after. Of course, I
include the ?quadX? column for illustration only. Upon
implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2
| counts quad1 quad2
<Rle> <IRanges> <Rle> <IRanges> | <integer>
<integer> <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
0 0
[2] chrA 1-2 --- chrA 9-10 | 1
0 1
[3] chrA 5-6 --- chrA 7-8 | 1
0 1
[4] chrA 5-6 --- chrA 9-10 | 1
0 1
[5] chrA 7-8 --- chrA 7-8 | 1
0 3
[6] chrA 7-8 --- chrA 11-12 | 1
0 3
[7] chrA 9-10 --- chrA 9-10 | 1
0 3
[8] chrA 1-2 --- chrA 15-16 | 1
1 0
[9] chrA 7-8 --- chrA 17-18 | 1
1 2
[10] chrA 9-10 --- chrA 15-16 | 1
1 2
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique
quadrant sequence defines the group.
GenomicInteractions object with 10 interactions and 1 metadata
column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 3-4 --- chrA 3-4 | 1
[2] chrA 1-2 --- chrA 9-10 | 1
[3] chrA 5-6 --- chrA 7-8 | 1
[4] chrA 5-6 --- chrA 9-10 | 1
[5] chrA 7-8 --- chrA 7-8 | 1
[6] chrA 7-8 --- chrA 11-12 | 1
[7] chrA 9-10 --- chrA 9-10 | 1
[8] chrA 1-2 --- chrA 15-16 | 1
[9] chrA 7-8 --- chrA 17-18 | 1
[10] chrA 9-10 --- chrA 15-16 | 1
-------
regions: 8 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions),
expanding the ranges accordingly and adding the counts.. This
process will
- ***identify all range-pairs in the same window and merge them
into a new range pair with appropriately expanded ranges*** (this is
my primary goal)
- sum the counts for each of the aforementioned range-pairs (i
have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
seqnames1 ranges1 seqnames2 ranges2 | counts
<Rle> <IRanges> <Rle> <IRanges> | <integer>
[1] chrA 1-6 --- chrA 1-6 | 1
[2] chrA 1-6 --- chrA 7-12 | 3
[3] chrA 7-12 --- chrA 7-12 | 3
[4] chrA 1-6 --- chrA 13-18 | 1
[5] chrA 7-12 --- chrA 13-18 | 2
-------
regions: 3 ranges and 0 metadata columns
seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is
6, though the counts will only change if there exists another
subrange covered by this bin/expansion that contains a positive
count.
As always, speed in a concern.
Best,
? Luke Klein
PhD Student
Department of Statistics
University of California, Riverside
lklei001 at ucr.edu
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <
https://stat.ethz.ch/mailman/listinfo/bioc-devel> Hello. I am planning to develop a new package which extends the GenomicInteractions package. I would like some help/advice on implementing the following functionality. Consider the follow GenomicInteractions object GenomicInteractions object with 10 interactions and 1 metadata column: seqnames1 ranges1 seqnames2 ranges2 | counts <Rle> <IRanges> <Rle> <IRanges> | <integer> [1] chrA 1-2 --- chrA 9-10 | 1 [2] chrA 1-2 --- chrA 15-16 | 1 [3] chrA 3-4 --- chrA 3-4 | 1 [4] chrA 5-6 --- chrA 7-8 | 1 [5] chrA 5-6 --- chrA 9-10 | 1 [6] chrA 7-8 --- chrA 7-8 | 1 [7] chrA 7-8 --- chrA 11-12 | 1 [8] chrA 7-8 --- chrA 17-18 | 1 [9] chrA 9-10 --- chrA 9-10 | 1 [10] chrA 9-10 --- chrA 15-16 | 1 ------- regions: 8 ranges and 0 metadata columns seqinfo: 1 sequence from an unspecified genome; no seqlengths Which is visually represented thusly I would like to do the following: 1) I want to group the regions into bins of WxW (in this case, W will be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki/Quadtree> with the final group being WxW (instead of 2x2). This will involve - iteratively dividing the matrix into quadrants {upper-left (0), upper-right (1), lower-left (2), lower-right(3)} . - labeling each subdivision in a new column until the final WxW resolution is reached. - sorting by the columns GenomicInteractions object with 10 interactions and 1 metadata column: seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2 <Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer> [1] chrA 1-2 --- chrA 9-10 | 1 0 1 [2] chrA 1-2 --- chrA 15-16 | 1 1 0 [3] chrA 3-4 --- chrA 3-4 | 1 0 0 [4] chrA 5-6 --- chrA 7-8 | 1 0 1 [5] chrA 5-6 --- chrA 9-10 | 1 0 1 [6] chrA 7-8 --- chrA 7-8 | 1 0 3 [7] chrA 7-8 --- chrA 11-12 | 1 0 3 [8] chrA 7-8 --- chrA 17-18 | 1 1 2 [9] chrA 9-10 --- chrA 9-10 | 1 0 3 [10] chrA 9-10 --- chrA 15-16 | 1 1 2 ------- regions: 8 ranges and 0 metadata columns seqinfo: 1 sequence from an unspecified genome; no seqlengths Sorting by the two columns yields what I am after. Of course, I include the ?quadX? column for illustration only. Upon implementation, I would like these columns hidden from the user. GenomicInteractions object with 10 interactions and 1 metadata column: seqnames1 ranges1 seqnames2 ranges2 | counts quad1 quad2 <Rle> <IRanges> <Rle> <IRanges> | <integer> <integer> <integer> [1] chrA 3-4 --- chrA 3-4 | 1 0 0 [2] chrA 1-2 --- chrA 9-10 | 1 0 1 [3] chrA 5-6 --- chrA 7-8 | 1 0 1 [4] chrA 5-6 --- chrA 9-10 | 1 0 1 [5] chrA 7-8 --- chrA 7-8 | 1 0 3 [6] chrA 7-8 --- chrA 11-12 | 1 0 3 [7] chrA 9-10 --- chrA 9-10 | 1 0 3 [8] chrA 1-2 --- chrA 15-16 | 1 1 0 [9] chrA 7-8 --- chrA 17-18 | 1 1 2 [10] chrA 9-10 --- chrA 15-16 | 1 1 2 ------- regions: 8 ranges and 0 metadata columns seqinfo: 1 sequence from an unspecified genome; no seqlengths The sorting gives me the quad-tree structure, and each unique quadrant sequence defines the group. GenomicInteractions object with 10 interactions and 1 metadata column: seqnames1 ranges1 seqnames2 ranges2 | counts <Rle> <IRanges> <Rle> <IRanges> | <integer> [1] chrA 3-4 --- chrA 3-4 | 1 [2] chrA 1-2 --- chrA 9-10 | 1 [3] chrA 5-6 --- chrA 7-8 | 1 [4] chrA 5-6 --- chrA 9-10 | 1 [5] chrA 7-8 --- chrA 7-8 | 1 [6] chrA 7-8 --- chrA 11-12 | 1 [7] chrA 9-10 --- chrA 9-10 | 1 [8] chrA 1-2 --- chrA 15-16 | 1 [9] chrA 7-8 --- chrA 17-18 | 1 [10] chrA 9-10 --- chrA 15-16 | 1 ------- regions: 8 ranges and 0 metadata columns seqinfo: 1 sequence from an unspecified genome; no seqlengths 2) Then I would like to merge the WxW window (i.e. bin the regions), expanding the ranges accordingly and adding the counts.. This process will - ***identify all range-pairs in the same window and merge them into a new range pair with appropriately expanded ranges*** (this is my primary goal) - sum the counts for each of the aforementioned range-pairs (i have already figured a way to do this) GenomicInteractions object with 5 interactions and 1 metadata column: seqnames1 ranges1 seqnames2 ranges2 | counts <Rle> <IRanges> <Rle> <IRanges> | <integer> [1] chrA 1-6 --- chrA 1-6 | 1 [2] chrA 1-6 --- chrA 7-12 | 3 [3] chrA 7-12 --- chrA 7-12 | 3 [4] chrA 1-6 --- chrA 13-18 | 1 [5] chrA 7-12 --- chrA 13-18 | 2 ------- regions: 3 ranges and 0 metadata columns seqinfo: 1 sequence from an unspecified genome; no seqlengths NOTE that ranges1 and ranges2 MUST expand so that the region width is 6, though the counts will only change if there exists another subrange covered by this bin/expansion that contains a positive count. As always, speed in a concern. Best, ? Luke Klein PhD Student Department of Statistics University of California, Riverside lklei001 at ucr.edu <mailto:lklei001 at ucr.edu> _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
One thing I?m not clear on is how to expand the ranges from one step to the next. ?The way GenomicInteractions are structured, there is a Granges object with all possible ranges, and the GInteractions object is populated by reference to said interactions.
If you're already in C++, then life is easy. Just compute the minimum bounding box within each run of interactions (if you want empirical bounds) or compute the theoretical bounds based on the bin widths and offsets. Collect all of these ranges and report them at the end of the function to construct a new G(enomic)Interactions object. Note that the theoretical bounds will be better at leveraging the memory-saving internal structure of the GInteractions object, as it is more likely that the anchor regions will be shared between different bin pairs at the same height of the tree. With empirical bounds, this is unlikely to be the case.
What I am going to need is a new GRanges object with the new set of (expanded) ranges, and a way to map the prior ranges to the new, wider range.
If you're constructing the quad-tree using the recursive algorithm discussed below, mapping is trivial. At each step of the recursion, you pass the identity of the parent interaction and record it for each new child interaction that is generated. -A
On Feb 13, 2019, at 3:34 AM, Aaron Lun <infinite.monkeys.with.keybo
ards at gmail.com> wrote: Note that your visual won't show up for many (all?) of us. Nonetheless, I think I know what you want to do. Your task does not lend itself to vectorization, which makes it difficult to write efficient R code. It's not impossible, but it would be quite hard to read and debug, and your maintenance programmer will be cursing you somewhere down the line. If speed is truly a concern, I would write this code in C++. This would probably be several lines' worth of code: 1. Compute a pair of bin IDs for each interaction by dividing each anchor coordinate by the bin width and truncating the result. (You'll need to decide if you want to use the midpoint/start/end, etc.) 2. Sort the interactions by the paired bin IDs, e.g., with std::sort. 3. Identify each "run" of interactions with the same paired IDs. 4. Repeat step 1 within each run (you'll need to offset the anchor coordinate before dividing this time). Append the current quadrant to the quadrant sequence for return to R at the end of recursion. Clear, concise, and can be slapped together in less than half an hour with Rcpp and C++11, if you know what you're doing. -A On Tue, 2019-02-12 at 11:34 -0800, Luke Klein wrote:
Hello.??I am planning to develop a new package which extends the GenomicInteractions package.??I would like some help/advice on implementing the following functionality. Consider the follow GenomicInteractions object GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ??[10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ??------- ??regions: 8 ranges and 0 metadata columns ??seqinfo: 1 sequence from an unspecified genome; no seqlengths Which is visually represented thusly I would like to do the following: 1) I want to group the regions into bins of WxW (in this case, W will be 3), as in a quad-tree structure <https://en.wikipedia.org/wiki /Qua dtree> with the final group being WxW (instead of 2x2).??This will involve? - iteratively dividing the matrix into quadrants {upper-left (0), upper-right (1), lower-left (2), lower-right(3)} . - labeling each subdivision in a new column until the final WxW resolution is reached. - sorting by the columns GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts?????quad1?????quad2 ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> <integer> <integer> ???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???????1 ?0 ???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???????0 ?0 ???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???????0 ?1 ???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???????0 ?3 ? ???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???????0 ?3 ???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???????1 ?2 ???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???????0 ?3 ??[10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ???????1 ?2 ??------- ??regions: 8 ranges and 0 metadata columns ??seqinfo: 1 sequence from an unspecified genome; no seqlengths Sorting by the two columns yields what I am after.??Of course, I include the ?quadX? column for illustration only.??Upon implementation, I would like these columns hidden from the user. GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts?????quad1?????quad2 ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> <integer> <integer> ???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???????0 ?0 ???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???????0 ?1 ???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???????0 ?1 ???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???????0 ?3 ???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???????0 ?3 ? ???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???????0 ?3 ???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???????1 ?0 ???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ???????1 ?2 ??[10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ???????1 ?2 ??------- ??regions: 8 ranges and 0 metadata columns ??seqinfo: 1 sequence from an unspecified genome; no seqlengths The sorting gives me the quad-tree structure, and each unique quadrant sequence defines the group. ?? GenomicInteractions object with 10 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1 ???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1 ???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1 ???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1 ???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1 ???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1 ???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1 ???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1 ???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1 ??[10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1 ??------- ??regions: 8 ranges and 0 metadata columns ??seqinfo: 1 sequence from an unspecified genome; no seqlengths 2) Then I would like to merge the WxW window (i.e. bin the regions), expanding the ranges accordingly and adding the counts..??This process will - ***identify all range-pairs in the same window and merge them into a new range pair with appropriately expanded ranges*** (this is my primary goal) - sum the counts for each of the aforementioned range-pairs (i have already figured a way to do this) GenomicInteractions object with 5 interactions and 1 metadata column: ???????seqnames1???ranges1?????seqnames2???ranges2 |????counts ???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer> ???[1]??????chrA???????1-6??---??????chrA??????1-6??|?????????1 ???[2]??????chrA???????1-6??---??????chrA??????7-12 |?????????3 ???[3]??????chrA???????7-12 ---??????chrA??????7-12 |?????????3 ???[4]??????chrA???????1-6??---??????chrA?????13-18 |?????????1 ???[5]??????chrA???????7-12 ---??????chrA?????13-18 |?????????2 ??------- ??regions: 3 ranges and 0 metadata columns ??seqinfo: 1 sequence from an unspecified genome; no seqlengths NOTE that ranges1 and ranges2 MUST expand so that the region width is 6, though the counts will only change if there exists another subrange covered by this bin/expansion that contains a positive count. As always, speed in a concern. Best, ? Luke Klein ????PhD Student ????Department of Statistics ????University of California, Riverside ????lklei001 at ucr.edu
_______________________________________________ Bioc-devel at r-project.org?mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hello. ?I am planning to develop a new package which extends the
GenomicInteractions package. ?I would like some help/advice on
implementing the following functionality.
Consider the follow GenomicInteractions object
GenomicInteractions object with 10 interactions and 1 metadata
column:
???????seqnames1???ranges1?????seqnames2???ranges2 |????counts
???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer>
???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1
???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1
???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1
???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1
???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1
???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1
???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1
???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1
???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1
? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1
? -------
? regions: 8 ranges and 0 metadata columns
? seqinfo: 1 sequence from an unspecified genome; no seqlengths
Which is visually represented thusly
I would like to do the following:
1) I want to group the regions into bins of WxW (in this case, W will
be 3), as in a?quad-tree structure?with the final group being WxW
(instead of 2x2). ?This will involve?
-?iteratively dividing the matrix into quadrants {upper-left
(0), upper-right (1), lower-left (2), lower-right(3)} .
- labeling each subdivision in a new column until the final WxW
resolution is reached.
- sorting by the columns
GenomicInteractions object with 10 interactions and 1 metadata
column:
???????seqnames1???ranges1?????seqnames2???ranges2
|????counts?????quad1?????quad2
???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer>
<integer> <integer>
???[1]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1
???????0 ?1
???[2]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1
???????1 ?0
???[3]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1
???????0 ?0
???[4]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1
???????0 ?1
???[5]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1
???????0 ?1
???[6]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1
???????0 ?3 ?
???[7]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1
???????0 ?3
???[8]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1
???????1 ?2
???[9]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1
???????0 ?3
? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1
???????1 ?2
? -------
? regions: 8 ranges and 0 metadata columns
? seqinfo: 1 sequence from an unspecified genome; no seqlengths
Sorting?by the two columns yields what I am after. ?Of course, I
include the??quadX? column for illustration only. ?Upon
implementation, I would like these columns hidden from the user.
GenomicInteractions object with 10 interactions and 1 metadata
column:
???????seqnames1???ranges1?????seqnames2???ranges2
|????counts?????quad1?????quad2
???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer>
<integer> <integer>
???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1
???????0 ?0
???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1
???????0 ?1
???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1
???????0 ?1
???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1
???????0 ?1
???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1
???????0 ?3
???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1
???????0 ?3 ?
???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1
???????0 ?3
???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1
???????1 ?0
???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1
???????1 ?2
? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1
???????1 ?2
? -------
? regions: 8 ranges and 0 metadata columns
? seqinfo: 1 sequence from an unspecified genome; no seqlengths
The sorting gives me the quad-tree structure, and each unique
quadrant sequence defines the group.
??
GenomicInteractions object with 10 interactions and 1 metadata
column:
???????seqnames1???ranges1?????seqnames2???ranges2 |????counts
???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer>
???[1]??????chrA???????3-4 ---??????chrA???????3-4 |?????????1
???[2]??????chrA???????1-2 ---??????chrA??????9-10 |?????????1
???[3]??????chrA???????5-6 ---??????chrA???????7-8 |?????????1
???[4]??????chrA???????5-6 ---??????chrA??????9-10 |?????????1
???[5]??????chrA???????7-8 ---??????chrA???????7-8 |?????????1
???[6]??????chrA???????7-8 ---??????chrA?????11-12 |?????????1
???[7]??????chrA??????9-10 ---??????chrA??????9-10 |?????????1
???[8]??????chrA???????1-2 ---??????chrA?????15-16 |?????????1
???[9]??????chrA???????7-8 ---??????chrA?????17-18 |?????????1
? [10]??????chrA??????9-10 ---??????chrA?????15-16 |?????????1
? -------
? regions: 8 ranges and 0 metadata columns
? seqinfo: 1 sequence from an unspecified genome; no seqlengths
2) Then I would like to merge the WxW window (i.e. bin the regions),
expanding the ranges accordingly and adding the counts.. ?This
process will
- ***identify all range-pairs in the same window and merge them
into a new range pair with appropriately expanded ranges*** (this is
my primary goal)
-?sum the counts for each of the aforementioned range-pairs (i
have already figured a way to do this)
GenomicInteractions object with 5 interactions and 1 metadata column:
???????seqnames1???ranges1?????seqnames2???ranges2 |????counts
???????????<Rle> <IRanges>?????????<Rle> <IRanges> | <integer>
???[1]??????chrA???????1-6??---??????chrA??????1-6??|?????????1
???[2]??????chrA???????1-6??---??????chrA??????7-12 |?????????3
???[3]??????chrA???????7-12 ---??????chrA??????7-12 |?????????3
???[4]??????chrA???????1-6??---??????chrA?????13-18 |?????????1
???[5]??????chrA???????7-12 ---??????chrA?????13-18 |?????????2
? -------
? regions: 3 ranges and 0 metadata columns
? seqinfo: 1 sequence from an unspecified genome; no seqlengths
NOTE that ranges1 and ranges2 MUST expand so that the region width is
6, though the counts will only change if there exists another
subrange covered by this bin/expansion that contains a positive
count.
As always, speed in a concern.
Best,
??Luke Klein
? ? PhD Student
? ? Department of Statistics
? ? University of California, Riverside
? ??lklei001 at ucr.edu