RFC: Checkpoint-Restart for R/HPC (DMTCP)
Hi, Gene, I know DMTCP from the scipy conference. Your colleague showed a python binding. I have also tried to invoke dmtcp inside R just like your python binding. It is not difficult as I remember. Best, KK
On Mon, Jan 25, 2016 at 8:03 PM, Gene Cooperman <gene at ccs.neu.edu> wrote:
Hi Chirag,
This should work. In my case, I would probably try running
a job on a cloud as follows:
[ copy DMTCP executables to job submission directory ]
path_to_dmtcp_root/bin/dmtcp_launch -i 30 Rscript myscript.R
This would create a checkpoint every 30 seconds. So, every 30 seconds,
we get a new version of the following files:
ckpt_myscript.R_*.dmtcp
dmtcp_restart_script_*.sh
dmtpc_restart_script.sh (symbolic link to dmtcp_restart_script_*.sh)
If a job crashes, one copies the above files to a new directory, and
submits a new Cloud job:
[ copy DMTCP executables to job submission directory ]
./dmtcp_restart_script.sh -i 30
The script should automatically link to the file ckpt_myscript.R_*.dmtcp .
An alternative approach would be:
path_to_dmtcp_root/bin/dmtcp_restart -i 30 ckpt_myscript.R_*.dmtcp
Please don't hesitate to ask, if I can help further.
Best,
- Gene
On Mon, Jan 25, 2016 at 05:26:58PM +0530, Chirag Anand wrote:
This can indeed be very useful, especially while using one of the cloud services. Cloud VMs often crash because of an error on the main system, thereby, losing state of the program (R computations). I think Google Cloud Engine supports live migration of VMs, though not sure which technology they are using, but AWS does not.
...
-- Chirag Anand http://atvariance.in/chiraganand
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Qiang Kou qkou at umail.iu.edu School of Informatics and Computing, Indiana University [[alternative HTML version deleted]]