RFC: Checkpoint-Restart for R/HPC (DMTCP)

Mon, Jan 25, 2016 5:03 PM

Hi Chirag,

    This should work.  In my case, I would probably try running
a job on a cloud as follows:

    [ copy DMTCP executables to job submission directory ]
    path_to_dmtcp_root/bin/dmtcp_launch -i 30 Rscript myscript.R

This would create a checkpoint every 30 seconds.  So, every 30 seconds,
we get a new version of the following files:

    ckpt_myscript.R_*.dmtcp
    dmtcp_restart_script_*.sh
    dmtpc_restart_script.sh  (symbolic link to dmtcp_restart_script_*.sh)

If a job crashes, one copies the above files to a new directory, and
submits a new Cloud job:

    [ copy DMTCP executables to job submission directory ]
    ./dmtcp_restart_script.sh -i 30

The script should automatically link to the file ckpt_myscript.R_*.dmtcp .
An alternative approach would be:

    path_to_dmtcp_root/bin/dmtcp_restart -i 30 ckpt_myscript.R_*.dmtcp

Please don't hesitate to ask, if I can help further.

Best,
- Gene

On Mon, Jan 25, 2016 at 05:26:58PM +0530, Chirag Anand wrote:

...

RFC: Checkpoint-Restart for R/HPC (DMTCP)

Thread (5 messages)