[Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20131002/93649c65/attachment.pl>
Hello Ryan,

You may be interested in the function sclapply(...) located in the
HTSeqGenie package. sclapply is a multicore dispatcher that accepts 3 main
arguments (inext, fun, max.parallel.jobs). The data produced by the
function inext, executed in the main thread, is dispatched to fun(),
executed in a children thread. A built-in scheduler controls the maximal
number of threads.

In HTSeqGenie, inext(...) is typically an iterator to read chunks of FastQ
reads, which are passed to a function processing the FastQ reads (for
counting, QC, alignment...) in a children thread. sclapply(...) enables
multicore processing of iterator flows and offers performance gains almost
proportional to the number of cores. Moreover, the function is robust and
contains extra arguments to handle exceptions and periodical tracing (e.g.
to check memory usage).

Hope this can help,
I'd like to incorporate these ideas (distil Ryan's and Greg's) in to 
BiocParallel, as bpiterate or maybe bpstream (I think in the literature stream 
has the notion of indeterminate, which isn't quite accurate). Let me know (on or 
off list) if that's not ok.

It would be interesting to see support for other back-ends.

And to come up with a consistent error handling model, incorporating the work 
Michel has recently completed (not yet in BiocParallel) as part of GSOC

   https://github.com/Bioconductor/BiocParallel/pull/19

Martin
Cheers,

Greg

On Mon, Sep 30, 2013 at 5:00 PM, Ryan <rct at thompsonclan.org> wrote:

Hi all,

I have previously written an Rscript to read, filter, and write large
fastq files using FastqSteamer to read. Through some complicated tricks, I
was able to get the input to happen in parallel with the processing and
output (using parallel::mcparallel and friends). In other words, while my
script was processing and writing out the nth block of reads, another
process was reading the (n+1)th block of reads at the same time. This
almost doubled the speed of my script (the server had sufficient I/O
bandwidth to parallelize reads and writes to disk). Since then, I've been
wanting to generalize this pattern, and I have just now made a working
proof of concept. It is a wrapper for FastqStreamer that runs in a separate
process and uses parallel:::sendMaster to send each block to the main
script, and then calls yield on the FastqStreamer to preload the next block
while the script is processing the current one. You can view and download
the script here:

https://gist.github.com/**DarwinAwardWinner/6771922<https://gist.github.com/DarwinAwardWinner/6771922>

I have strategically placed print statementsin the code in order to
demonstrate that preloading is happening. For example, I get the following
when I run the script on my machine:

CHILD: Preloaded 1 yields.
CHILD: Sent 1 yields.
CHILD: Preloaded 2 yields.
CHILD: Sent 2 yields.
MAIN: Received 1 yields.
MAIN: Processing reads
CHILD: Preloaded 3 yields.
MAIN: Processed 1 yields.
CHILD: Sent 3 yields.
MAIN: Received 2 yields.
MAIN: Processing reads
CHILD: Preloaded 4 yields.
MAIN: Processed 2 yields.
CHILD: Sent 4 yields.
MAIN: Received 3 yields.
MAIN: Processing reads
CHILD: Preloaded 5 yields.
MAIN: Processed 3 yields.
CHILD: Sent 5 yields.
MAIN: Received 4 yields.
MAIN: Processing reads
CHILD: Preloaded 6 yields.
MAIN: Processed 4 yields.
CHILD: Sent 6 yields.
MAIN: Received 5 yields.
MAIN: Processing reads
MAIN: Processed 5 yields.
MAIN: Received 6 yields.
MAIN: Processing reads
MAIN: Processed 6 yields.

In the script, the child is reading the fastq file, and the main process
is doing the "calculation" (which is just a sleep). As you can see, the
child is always a step or two ahead of the main script, so that whenever
the main script asks for the next yield, it gets it immediately instead of
waiting for the child to read from the disk.

So, is this kind of feature appropriate for inclusion into BioConductor?

-Ryan Thompson

______________________________**_________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793