This is an old revision of the document!
Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with DMTCP for serial and parallel jobs (including multi-host).
A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation
An excellent in depth explanation can be found in this article
Some quotes from the article … CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specificsave/restore operations, and then delegating to a traditional checkpoint-restart package…. In the end, the support of DMTCP for processvirtualization and plugins [20] makes it easier to add modular support for CUDA without hav-ing to excessively understand details of the internals of the host checkpointing package