"chunking" parallel tasks
One other approach, were the computation per chunk runs into several(tens of ) minutes, is to monitor the running time of long running tasks(each working on a chunk), if greater than a cut off, split the chunk and assign to unused (or lesser loaded) machines. If a task for a particular chunk finishes earlier than some task for a duplicated task, invalidate the latter and kill it. Of course, the run time for a chunk should be greater(much) than the cost of duplicating a chunk, reading it in and starting new tasks. To implement this, one would have write a system which actually monitors the running time tasks, the chunking and duplication. I think Hadoop Mapreduce does something similar, though it is most certainly not the best tool for some tasks. Regards Saptarshi
On Tue, Jan 26, 2010 at 12:54 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 01/26/2010 09:51 AM, Norm Matloff wrote:
There are some very sophisticated strategies for chunking out there in the parallel processing world, aimed at dealing with the load balancing problem. ?The idea is to use very large chunks at the beginning and middle stages of the computation, to minimize communication overhead and the like, but then switch to smaller ones near the end in order to keep all the processes busy. ?For example, the "guided" option in OpenMP does this. However, in my experience, it is seldom necessary to resort to this, as load balancing is typically not a problem. ?One can even show this theoretically: ?Say task times are T[1], T[2], etc., and are i.i.d. ?The standard deviation of sum(T[1:k]), divided by the mean, goes to 0 as k goes to infinity--i.e. the sum is essentially constant. ?Of course, that is an idealized model, but again in practice I have found that load balancing is not much of an issue. For that reason and because of the communication overhead, in most cases it is actually faster to statically assign 1/p of the tasks to each of the p processes, i.e. not do chunking.
Agreed, and for the matrix operations Mark hinted at one likely gets the most benefit by using R's vectorized matrix operations (rather than *apply), then by using vectorized blas/lapack libraries, then multicore. Probably distributing tasks across nodes in a cluster (e.g., mpi) for matrix operations is almost always a losing proposition -- communication costs are just too high. Martin
Norm Matloff
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc