\\ **[[cluster:0|Back]]** ===== cuda-checkpoint ===== Newly developed cuda tool to keep track of. Sounds initially good but there some items to check/test out. Also need to track [[cluster:211|DMTCP CRAC]] tool that almost worked. * https://developer.nvidia.com/blog/checkpointing-cuda-applications-with-criu/ CRIU (Checkpoint/Restore in Userspace) is an open-source checkpointing utility. Works with cuda driver 550 and higher (although I do not see it in exx96's cuda-12.4 installation). But it is present in test's (11.6) and mwgpu256's cuda installations (12.6). NVIDIA GPUs provide functionality beyond that of a standard Linux kernel. ''cuda-checkpoint'' provides capabilities to support gpu application checkpointing by supporting those functionalities. Download the tool from url above pointing to git source. But ... * cuda-checkpoint does not suspend CPU threads That's weird. What if the cpu and gpu work together to solve a problem? Primarily on gpu but helped by cpu (like in lammps). CRIU takes care of that? * When suspending application on resources it frees the GPU resources Heh? Why not continue running, we just want a checkpoint to be able to fall back upon in the event of a crash. * GPUs are re-acquired by the process on a resume Also weird, how does this work within a scheduler environment? Questions to answer and test out. \\ **[[cluster:0|Back]]**