snow's clusterApplyLB quits when a slave node stops
A (partially written) file would be somewhat annoying, but otherwise, no, I do not care what happens to an unfinished job. I detect these while collecting the results from the thousands of resulting files. To give you and idea of how crude my SNOW usage is: for years I got by with a script that spawned more scripts on each machine (one for each CPU I wanted to use) that read from a common file containing a list of shell commands. However, SNOW is much nicer to work with. Many of the people I work with use SNOW in this way. Not so much HPC but trying to run repeated experiments on 100s of multi-core workstations. So, while I am at it, R's limit of around 127 socket connections is also an irritation. I am using sockets via makeSOCKcluster. Thank you. - dan On Fri, May 21, 2010 at 10:45 AM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
Dan, So you don't care what is in the result list returned by clusterApplyLB? You only care about the files created by the tasks, just as if you were submitting a bunch of independent batch jobs? ?Do you care if a result file is partially written when it gets killed? ?Or is that condition easy to detect? Also, what snow transport are you using? - Steve On Fri, May 21, 2010 at 11:14 AM, Daniel Elliott <danelliottster at gmail.com> wrote:
Thanks, Steve. I know what I want and it is a good starting point for the other solutions you mentioned: just keep sending jobs to the slave nodes even when one of them dies. ?For me, each job is totally independent and the results are saved to a file so I just want the thing to keep going. I would be happy to be a part of any solution... - dan On Fri, May 21, 2010 at 8:36 AM, Stephen Weston <stephen.b.weston at gmail.com> wrote:
Dan, The snow package wasn't designed to be fault tolerant, and so I don't think it is surprising if clusterApplyLB hangs when a job gets killed. ?I don't know why you've only started seeing this behavior lately, especially since the version of snow hasn't changed in quite a while. You might want to investigate the snowFT package, which is available on CRAN. ?It depends on PVM/rpvm, however, so you can't use MPI with snowFT, for example. You might want to think about what behavior you'd like to see if a job is killed. ?Some people want the job to be automatically resubmitted, but maybe you just want an appropriate error reported for that job. ?Some people are happy as long as the whole run doesn't hang forever, even if all of the results are lost. ?Depending on your needs, you might be able to figure out a solution to the problem. ?It could also help you to evaluate whether someone else's proposed solution meets your needs. - Steve On Thu, May 20, 2010 at 10:38 PM, Daniel Elliott <danelliottster at gmail.com> wrote:
Hello, I use SNOW to run a large number of processes on a large number of computers many of which are in a lab. ?Sometimes my jobs are killed for various reasons. ?Lately, this has caused the clusterApplyLB function to stop running which means no additional jobs are run. Is there something I can do about this? ?I am pretty sure that this was not happening a few months ago (clusterApplyLB would keep running jobs even when one of the slaves went down). Thanks. - dan elliott
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc