multicore: when a core wanders off... - R-SIG-HPC

Mon, Feb 28, 2011 1:35 PM #

I have been working with a system in which, most of the time,
a long-running mclapply will fail ostensibly because at least
one node has simply lost the child process assigned
to it.  Other nodes succeed in writing some data;
stderr from the R process in which mclapply was invoked has no
interesting information, it seems R simply
dies.  Any suggestions on how to get more diagnostic
information? gdb on the master R process doesn't seem relevant.

It seems feasible, using parallel/collect, to write a fault-tolerant
mclapply-like function that would attempt to fill-in list cells
that failed to populate in the expected time.  Has anyone undertaken such?