I have been working with a system in which, most of the time, a long-running mclapply will fail ostensibly because at least one node has simply lost the child process assigned to it. Other nodes succeed in writing some data; stderr from the R process in which mclapply was invoked has no interesting information, it seems R simply dies. Any suggestions on how to get more diagnostic information? gdb on the master R process doesn't seem relevant. It seems feasible, using parallel/collect, to write a fault-tolerant mclapply-like function that would attempt to fill-in list cells that failed to populate in the expected time. Has anyone undertaken such?
multicore: when a core wanders off...
1 message · Vincenzo Carey