Newly developed cuda tool to keep track of. Sounds initially good but there some items to check/test out.
Also need to track DMTCP CRAC tool that almost worked.
CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility. Works with cuda driver 550 and higher (although I do not see it in exx96's cuda-12.4 installation). But it is present in test's (11.6) and mwgpu256's cuda installations (12.6).
NVIDIA GPUs provide functionality beyond that of a standard Linux kernel. cuda-checkpoint
provides capabilities to support gpu application checkpointing by supporting those functionalities.
Download the tool from url above pointing to git source.
But …
That's weird. What if the cpu and gpu work together to solve a problem? Primarily on gpu but helped by cpu (like in lammps). CRIU takes care of that?
Heh? Why not continue running, we just want a checkpoint to be able to fall back upon in the event of a crash.
Also weird, how does this work within a scheduler environment?
Questions to answer and test out.