Skip to content

[Bioc-devel] GenomicFiles: chunking

5 messages · Valerie Obenchain, Michael Love

#
hi Kasper,

For a concrete example, I posted a R and Rout file here:

https://gist.github.com/mikelove/deaff999984dc75f125d

Things to note: 'ranges' is a GRangesList, I cbind() the numeric
vectors in the REDUCE, and then rbind() the final list to get the
desired matrix.

Other than the weird column name 'init', does this give you what you want?

best,

Mike

On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
<michaelisaiahlove at gmail.com> wrote:
26 days later
#
Hi Kasper and Mike,

I've added 2 new functions to GenomicFiles and deprecated the old 
classes. The vignette now has a graphic which (hopefully) clarifies the 
MAP / REDUCE mechanics of the different functions.

Below is some performance testing for the new functions and answers to 
leftover questions from previous emails.


Major changes to GenomicFiles (in devel):

- *FileViews classes have been deprecated:

The idea is to use the GenomicFiles class to hold any type of file be it 
BAM, BigWig, character vector etc. instead of having name-specific 
classes like BigWigFileViews. Currently GenomicFiles does not inherit 
from SummarizedExperiment but it may in the future.

- Add reduceFiles() and reduceRanges():

These functions pass all ranges or files to MAP vs the lapply approach 
taken in reduceByFiles() and reduceByRanges().


(1) Performance:

When testing with reduceByFile() you noted "GenomicFiles" is 10-20x 
slower than the straightforward approach". You also noted this was 
probably because of the lapply over all ranges - true. (Most likely 
there was overhead in creating the SE object as well.) With the new 
reduceFiles(), passing all ranges at once, we see performance very 
similar to that of the 'by hand' approach.

In the test code I've used Bam instead of BigWig. Both test functions 
output lists, have comparable MAP and REDUCE steps etc.

I used 5 files ('fls') and a granges ('grs') of length 100.
 > length(grs)
[1] 100

 > sum(width(grs))
[1] 1000000

FUN1 is the 'by hand' version. These results are similar to what you 
saw, not quite a 4x difference between 10 and 100 ranges.

 >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
 > Unit: seconds
 >                   expr      min       lq     mean   median       uq 
     max
 >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.222298 
1.256741
 >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463 
4.533846
 >  neval
 >     10
 >     10

FUN2 is the reduceFiles() approach and the results are very similar to FUN1.

 >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
 > Unit: seconds
 >                   expr      min       lq     mean   median       uq 
     max
 >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655 
1.275698
 >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064 
4.676068
 >  neval
 >     10
 >     10


(2) Request for "chunking of the mapping of ranges":

For now we decided not to add a 'yieldSize' argument for chunking. There 
are 2 approaches to chunking through ranges *within* the same file. In 
both cases the user spits the ranges, either before calling the function 
or in the MAP step.

i) reduceByFile() with a GRangesList:

The user provides a GRangesList as the 'ranges' arg. On each worker, 
lapply applies MAP to the one files and all elements of the GRangesList.

ii) reduceFiles() with a MAP that handles chunking:

The user split ranges in MAP and uses lapply or another loop to iterate. 
For example,

MAP <- function(range, file, ...) {
     lst = split(range, someFactor)
     someFUN = function(rngs, file, ...) do something
     lapply(lst, FUN=someFun, ...)
}

The same ideas apply for chunking though ranges *across* files with 
reduceByRange() and reduceRanges().

iii) reduceByRange() with a GRangesList:

Mike has a good example here:
https://gist.github.com/mikelove/deaff999984dc75f125d

iv) reduceRanges():

'ranges' should be a GRangesList. The MAP step will operate on an 
element of the GRangesList and all files. Unless you want to operate on 
all files at once I'd use reduceByRange() instead.


(3) Return objects have different shape:

Previous question:

"...
Why does the return object of reduceByFile vs reduceByRange (with
summarize = FALSE) different?  I understand why internally you have
different nesting schemes (files and ranges) for the two functions, but 
it is not clear to me that it is desirable to have the return object 
depend on how the computation was done.
..."

reduceByFile() and reduceFiles() output a list the same length as the 
number of files while reduceByRange() and reduceRanges() output a list 
the same length as the number of ranges.

Reduction is different depending on which function is chosen; data are 
collapsed either within a file or across files. When REDUCE does 
something substantial the output are not equivalent.

While it's possible to get the same result (REDUCE simply unlists or 
isn't used), the two approaches were not intended to be equivalent ways 
of arriving at the same end. The idea was that the user had a specific 
use case in mind - they either wanted to collapse the data across or 
within files.


(4) return object from coverage(BigWigFileViews):

Previous comment:

"...
coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
...
Specifically, each (i,j) entry in the object is an RleList with a single
element with a name equal to the seqnames of the i'th entry in the query
GRanges.  To me, this extra nestedness is unnecessary; I would have
expected an Rle instead of an RleList with 1 element.
..."

The return value from coverage(x) is an RleList with one coverage vector 
per seqlevel in 'x'. Even if there is only one seqlevel, the result 
still comes back as an RleList. This is just the default behavior.


(5) separate the 'read' function from the MAP step

Previous comment:

"...
Also, something completely different, it seems like it would be 
convenient for stuff like BigWigFileViews to not have to actually parse 
the file in the MAP step.  Somehow I would envision some kind of reading 
function, stored inside the object, which just returns an Rle when I ask 
for a (range, file).  Perhaps this is better left for later.
..."

The current approach for the reduce* functions is for MAP to both 
extract and manipulate data. The idea of separating the extraction step 
is actually implemented in reduceByYield(). (This function used to be 
yieldReduce() in Rsamtools in past releases.) For reduceByYield() teh 
user must specify YIELD (a reader function), MAP, REDUCE and DONE 
(criteria to stop iteration).

I'm not sure what is best here. I thought the many-argument approach of 
reduceByYield() was possibly confusing or burdensome and so didn't use 
it in the other GenomicFiles functions. Maybe it's not confusing but 
instead makes the individual steps more clear. What do you think,

- Should the reader function be separate from the MAP? What are the 
advantages?

- Should READER, MAP, REDUCE be stored inside the GenomicFiles object or 
supplied as arguments to the functions?


(6) unnamed assay in SummarizedExperiment

Previous comment:

"...
The return object of reduceByRange / reduceByFile with summarize = TRUE 
is a SummarizedExperiment with an unnamed assay.  I was surprised to see 
that this is even possible.
..."

There is no default name for SummarizedExperiment in general. I've named 
the assay 'data' for lack of a better term. We could also go with 
'reducedData' or another suggestion.


Thanks for the feedback.

Valerie
On 10/01/2014 08:30 AM, Michael Love wrote:
#
Looks like the test code didn't make it through. Attaching again ...
On 10/27/2014 11:35 AM, Valerie Obenchain wrote:
-------------- next part --------------
library(GenomicFiles)
library(microbenchmark)

## 5 bam files ranging from 4e+08 to 8e+08 records:
fls <- BamFileList(c("exp_srx036692.bam", "exp_srx036695.bam",
                     "exp_srx036696.bam", "exp_srx036697.bam",
                     "exp_srx036692.bam")) ## re-use one file

## GRanges with 100 ranges and total width 1e+06:
starts <- sample(1:1e7, 100)
chr <- paste0(rep("chr", 100), 1:22)
grs <- GRanges(chr,  IRanges(starts, width=1e4))


## By hand:
FUN1 <- function(ranges, files) {
    ## equivalent to MAP step
    cvg <- lapply(files,
        FUN = function(file, range) {
            param = ScanBamParam(which=range)
            coverage(file, param=param)[range]
        }, range=ranges)
    ## equivalent to REDUCE step
    do.call(cbind, lapply(cvg, mean))
}

microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
## GenomicFiles:
MAP = function(range, file, ...) {
    param = ScanBamParam(which=range)
    coverage(file, param=param)[range]
}
REDUCE <- function(mapped, ...) do.call(cbind, lapply(mapped, mean))

FUN2 <- function(ranges, files) {
    reduceFiles(ranges, files, MAP, REDUCE, BPPARAM=SerialParam())
}

microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
#
hi Valerie,

this sounds good to me.

I am thinking of working on a function (here or elsewhere) that helps
decide, for reduce by range, how to optimally chunk GRanges into a
GRangesList. Practically, this could involve sampling the size of the
imported data for a subset of cells in the (ranges, files) matrix, and
asking/estimating the amount of memory available for each worker.

best,

Mike


On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain <vobencha at fhcrc.org>
wrote:

  
  
#
Sounds great. I think GenomicFiles is a good place for such a function - 
it's along the lines of what we wanted to accomplish with pack / unpack.

Maybe your new function can be used by pack once finished. There's 
definitely room for expanding that functionality.

Valerie
On 10/27/2014 12:07 PM, Michael Love wrote: