User Tools

Site Tools


cluster:198

This is an old revision of the document!



Back

GPU checkpoint/restart

Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with DMTCP for serial and parallel jobs (including multi-host).

A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation

An excellent in depth explanation can be found in this article

Some quotes from the article … CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specificsave/restore operations, and then delegating to a traditional checkpoint-restart package…. In the end, the support of DMTCP for processvirtualization and plugins [20] makes it easier to add modular support for CUDA without hav-ing to excessively understand details of the internals of the host checkpointing package


Back

cluster/198.1606919363.txt.gz · Last modified: 2020/12/02 09:29 by hmeij07