RFC: Checkpoint-Restart for R/HPC (DMTCP)
Hi Chirag,
This should work. In my case, I would probably try running
a job on a cloud as follows:
[ copy DMTCP executables to job submission directory ]
path_to_dmtcp_root/bin/dmtcp_launch -i 30 Rscript myscript.R
This would create a checkpoint every 30 seconds. So, every 30 seconds,
we get a new version of the following files:
ckpt_myscript.R_*.dmtcp
dmtcp_restart_script_*.sh
dmtpc_restart_script.sh (symbolic link to dmtcp_restart_script_*.sh)
If a job crashes, one copies the above files to a new directory, and
submits a new Cloud job:
[ copy DMTCP executables to job submission directory ]
./dmtcp_restart_script.sh -i 30
The script should automatically link to the file ckpt_myscript.R_*.dmtcp .
An alternative approach would be:
path_to_dmtcp_root/bin/dmtcp_restart -i 30 ckpt_myscript.R_*.dmtcp
Please don't hesitate to ask, if I can help further.
Best,
- Gene
On Mon, Jan 25, 2016 at 05:26:58PM +0530, Chirag Anand wrote:
This can indeed be very useful, especially while using one of the cloud services. Cloud VMs often crash because of an error on the main system, thereby, losing state of the program (R computations). I think Google Cloud Engine supports live migration of VMs, though not sure which technology they are using, but AWS does not.
...
-- Chirag Anand http://atvariance.in/chiraganand