Warning: Undefined array key 12 in
/usr/share/dokuwiki/inc/html.php on line
1453
Differences
This shows you the differences between two versions of the page.
Next revision
|
Previous revision
Next revision
Both sides next revision
|
cluster:198 [2020/12/02 09:21] hmeij07 created |
cluster:198 [2020/12/02 15:19] hmeij07 [GPU checkpoint/restart] |
===== GPU checkpoint/restart ===== | ===== GPU checkpoint/restart ===== |
| |
Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with [[cluster:190|DMTCP]] for serial and parallel jobs (including multi-host). | Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with [[cluster:190|DMTCP]] for serial and parallel jobs (including multi-host). But the CPU/GPU environment adds much complexity. |
| |
A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation | A good **overview** of the history of GPU checkpoint/restart efforts can be found at this presentation |
| |
* [[https://on-demand.gputechconf.com/gtc/2016/presentation/s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]] | * [[https://on-demand.gputechconf.com/gtc/2016/presentation/s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]] |
| |
| An **excellent** in depth explanation can be found in this article |
| |
| * [[https://arxiv.org/pdf/2008.10596.pdf|CRAC: Checkpoint-Restart Architecture for CUDA with Streamsand UVM]] |
| |
| * git site (notice __early development__): https://github.com/DMTCP-CRAC/CRAC-early-development |
| |
| Some quotes from the article ... //CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specific save/restore operations, and then delegating to a traditional checkpoint-restart package.... In the end, the support of DMTCP for process virtualization and plugins makes it easier to add modular support for CUDA without having to excessively understand details of the internals of the host checkpointing package.// |
| |
| |
| * Slurm DMTCP integration via plugin |
| * https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf |
| |
| |
\\ | \\ |