DokuWiki

This is an old revision of the document!

GPU checkpoint/restart

Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with DMTCP for serial and parallel jobs (including multi-host).

A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation

Transparent Checkpoint and Restart Technology for CUDA® applications

An excellent in depth explanation can be found in this article

arXiv:2008.10596v1 [cs.DC] 24 Aug 2020CRAC: Checkpoint-Restart Architecture for CUDA with Streamsand UVM

CRAC git site (notice early development): https://github.com/DMTCP-CRAC/CRAC-early-development

Some quotes from the article … CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specificsave/restore operations, and then delegating to a traditional checkpoint-restart package…. In the end, the support of DMTCP for processvirtualization and plugins [20] makes it easier to add modular support for CUDA without hav-ing to excessively understand details of the internals of the host checkpointing package

Back