User Tools

Site Tools


cluster:230


Back

cuda-checkpoint

Newly developed cuda tool to keep track of. Sounds initially good but there some items to check/test out.

Also need to track DMTCP CRAC tool that almost worked.

CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility. Works with cuda driver 550 and higher (although I do not see it in exx96's cuda-12.4 installation). But it is present in test's (11.6) and mwgpu256's cuda installations (12.6).

NVIDIA GPUs provide functionality beyond that of a standard Linux kernel. cuda-checkpoint provides capabilities to support gpu application checkpointing by supporting those functionalities.

Download the tool from url above pointing to git source.

But …

  • cuda-checkpoint does not suspend CPU threads

That's weird. What if the cpu and gpu work together to solve a problem? Primarily on gpu but helped by cpu (like in lammps). CRIU takes care of that?

  • When suspending application on resources it frees the GPU resources

Heh? Why not continue running, we just want a checkpoint to be able to fall back upon in the event of a crash.

  • GPUs are re-acquired by the process on a resume

Also weird, how does this work within a scheduler environment?

Questions to answer and test out.


Back

cluster/230.txt · Last modified: 2025/03/24 19:37 by hmeij07