From: Ross Boylan <ross at biostat.ucsf.edu>
In answer to the other question about using OS checkpointing
facilities, I haven't tried them since the application will be running
on a cluster. More precisely, the optimization will be driven from a
single machine, but the calculation of the objective function will be
distributed. So checkpointing at the level of the optimization
function is a good fit to my needs. There are some cluster OS's that
provide a kind of unified process space across the processors (scyld,
mosix), but we're not using them and checkpointing them is an unsolved
problem. At least, it was unsolved a couple of years ago when I
looked into it.