[Bioc-devel] GenomicFiles: chunking

On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
I don't fully understand "the use case for reducing by range is when the
entire dataset won't fit into memory".  The basic assumption of these
functions (as far as I can see) is that the output data fits in memory.
What may not fit in memory is various earlier "iterations" of the data.  For
example, in my use case, if I just read in all the data in all the ranges in
all the samples it is basically Rle's across 450MB times 38 files, which is
not small.  What fits in memory is smaller chunks of this; that is true for
every application.

I was unclear. I meant that, in approach1, you have an object,
all.Rle, which contains Rles for every range over every file. Can you
actually run this approach on the full dataset?
Reducing by range (or file) only makes sense when the final output includes
one entity for several ranges/files ... right?  So I don't see how reduce
would help me.

Yes, I think we agree. This is not a good use case for reduce by range
as now implemented.

This is a use case which would benefit from the user-facing function
internally calling pack()/unpack() to reduce the number of import()
calls, and then in the end giving back the mean coverage over the
input ranges. I want this too.

https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
(link to the old github repo, the new github repo is named GenomicFiles)
As I see the pack()/unpack() paradigm, it just re-orders the query ranges
(which is super nice and matters a lot for speed for some applications).  As
I understand the code (and my understanding is developing) we need an extra
layer to support processing multiple ranges in one operation.

I am happy to help apart from complaining.

Best,
Kasper

On Mon, Sep 29, 2014 at 8:55 AM, Michael Love <michaelisaiahlove at gmail.com>
wrote:
Thanks for checking it out and benchmarking. We should be more clear
in the docs that the use case for reducing by range is when the entire
dataset won't fit into memory. Also, we had some discussion and
Valerie had written up methods for packing up the ranges supplied by
the user into a better form for querying files. In your case it would
have packed many ranges together, to reduce the number of import()
calls like your naive approach. See the pack/unpack functions, which
are not in the vignette but are in the man pages. If I remember,
writing code to unpack() the result was not so simple, and development
of these functions was set aside for the moment.

Mike

On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
I am testing GenomicFiles.

My use case: I have 231k ranges of average width 1.9kb and total width
442
MB.  I also have 38 BigWig files.  I want to compute the average
coverage
of the 38 BigWig files inside each range.  This is similar to wanting to
get coverage of - say - all promoters in the human genome.

My data is residing on a file server which is connected to the compute
node
through ethernet, so basically I have very slow file access.  Also. the
BigWig files are (perhaps unusually) big: ~2GB / piece.

Below I have two approaches, one using basically straightforward code
and
the other using GenomicFiles (examples are 10 / 100 ranges on 5 files).
Basically GenomicFiles is 10-20x slower than the straightforward
approach.
This is most likely because reduceByRange/reduceByFile processes each
(range, file) separately.

It seems naturally (to me) to allow some chunking of the mapping of the
ranges.  My naive approach is fast (I assume) because I read multiple
ranges through one import() command.  I know I have random access to the
BigWig files, but there must still be some overhead of seeking and
perhaps
more importantly instantiating/moving stuff back and forth to R.  So
basically I would like to be able to write a MAP function which takes
  ranges, file
instead of just
  range, file
And then chunk over say 1,000s of ranges.  I could then have an argument
to
reduceByXX called something like rangeSize, which is kind of yieldSize.

Perhaps this is what is intended for the reduceByYield on BigWig files?

In a way, this is what is done in the vignette with the
coverage(BAMFILE)
example where tileGenome is essentially constructed by the user to chunk
the coverage computation.

I think the example of a set of regions I am querying on the files, will
be
an extremely common usecase going forward. The raw data for all the
regions
together is "too big" but I do a computation on each region to reduce
the
size.  In this situation, all the focus is on the MAP step.  I see the
need
for REDUCE in the case of the t-test example in the vignette, where the
return object is a single "thing" for each base.  But in general, I
think
we will use these file structures a lot to construct things without
REDUCE
(across neither files nor ranges).

Also, something completely different, it seems like it would be
convenient
for stuff like BigWigFileViews to not have to actually parse the file in
the MAP step.  Somehow I would envision some kind of reading function,
stored inside the object, which just returns an Rle when I ask for a
(range, file).  Perhaps this is better left for later.

Best,
Kasper

Examples

approach1 <- function(ranges, files) {
    ## By hand
    all.Rle <- lapply(files, function(file) {
        rle <- import(file, as = "Rle", which = ranges, format =
"bw")[ranges]
        rle
    })
    print(object.size(all.Rle), units = "auto")
    mat <- do.call(cbind, lapply(all.Rle, function(xx) {
        sapply(xx, mean)
    }))
    invisible(mat)
}

system.time({
mat1 <- approach1(all.grs[1:10], all.files[1:5])
})

160.9 Kb
    user  system elapsed
   1.109   0.001   1.113

system.time({
mat1 <- approach1(all.grs[1:100], all.files[1:5])
}) # less than 4x slower than previous call

3 Mb
    user  system elapsed
   4.101   0.019   4.291

approach2 <- function(ranges, files) {
    gf <- GenomicFiles(rowData = ranges, files = files)
    MAPPER <- function(range, file, ....) {
        library(rtracklayer)
        rle <- import(file, which = range, as = "Rle", format =
"bw")[range]
        mean(rle)
    }
    sExp <- reduceByRange(gf, MAP = MAPPER, summarize = TRUE, BPPARAM =
SerialParam())
    sExp
}

system.time({
mat2 <- approach2(all.grs[1:10], all.files[1:5])
})  # 8-9x slower than approach1

   user  system elapsed
   9.349   0.280   9.581

system.time({
mat2 <- approach2(all.grs[1:100], all.files[1:5])
})  # 9x slower than previous call, 20x slow than approach1 on same
input

   user  system elapsed
  89.310   0.627  91.044

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] GenomicFiles: chunking

Thread (7 messages)