GPU checkpoint/restart

Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with DMTCP for serial and parallel jobs (including multi-host).

A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation

An excellent in depth explanation can be found in this article

Some quotes from the article … CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specificsave/restore operations, and then delegating to a traditional checkpoint-restart package…. In the end, the support of DMTCP for processvirtualization and plugins [20] makes it easier to add modular support for CUDA without hav-ing to excessively understand details of the internals of the host checkpointing package


