Skip to content
Prev 6067 / 21312 Next

[Bioc-devel] Parallel processing of reads in a single fastq file

Hi Jeff,

See my replies below inline.
On 8/6/14, 7:16 AM, Johnston, Jeffrey wrote:
That's right, we don't currently have anything in R like e.g. Python's 
multiprocessing.Pool: 
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

Although we wouldn't want something exactly like that, because Python's 
implementation exhausts the input stream as fast as possible and buffers 
all the results in memory if they are not consumed quickly enough, as 
described here: 
http://stackoverflow.com/questions/5318936/python-multiprocessing-pool-lazy-iteration

I think it would be ideal to have a function that takes an input stream, 
a BPPARAM, and a maximum number of chunks to buffer, and returns another 
input stream of the results.
This is the problem, I think. A general solution would allow you to 
stream the processed results back into R, where you could pass them into 
another stream filter, or finally consume them. That was the idea behind 
the example code that I demonstrated, but my code worked a little 
differently, in that the task of *reading* the fastq file was delegated 
to a subprocess. So my solution also doesn't generalize to multi-step 
parallel pipelines.
-Ryan