User Tools

Site Tools


cluster:198

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:198 [2020/12/02 09:29]
hmeij07 [GPU checkpoint/restart]
cluster:198 [2020/12/03 08:09]
hmeij07 [GPU checkpoint/restart]
Line 4: Line 4:
 ===== GPU checkpoint/​restart ===== ===== GPU checkpoint/​restart =====
  
-Why I thought this was an easy problem to solve I do not know. CPU checkpoint/​restart has come a long way with [[cluster:​190|DMTCP]] for serial and parallel jobs (including multi-host). ​+Why I thought this was an easy problem to solve I do not know. CPU checkpoint/​restart has come a long way with [[cluster:​190|DMTCP]] for serial and parallel jobs (including multi-host). But the CPU/GPU environment adds much complexity.
  
-A good overview of the history of GPU checkpoint/​restart efforts can be found at this presentation+A good **overview** of the history of GPU checkpoint/​restart efforts can be found at this presentation
  
   * [[https://​on-demand.gputechconf.com/​gtc/​2016/​presentation/​s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]]   * [[https://​on-demand.gputechconf.com/​gtc/​2016/​presentation/​s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]]
Line 12: Line 12:
 An **excellent** in depth explanation can be found in this article An **excellent** in depth explanation can be found in this article
  
-  * [[https://​arxiv.org/​pdf/​2008.10596.pdf|arXiv:​2008.10596v1 ​ [cs.DC] ​ 24 Aug 2020CRAC: Checkpoint-Restart Architecture for CUDA with Streamsand UVM]]+  * [[https://​arxiv.org/​pdf/​2008.10596.pdf|CRAC: Checkpoint-Restart Architecture for CUDA with Streamsand UVM]]
  
-  * CRAC git site (notice ​early development): https://​github.com/​DMTCP-CRAC/​CRAC-early-development+  * git site (notice ​__early development__): https://​github.com/​DMTCP-CRAC/​CRAC-early-development
  
-Some quotes from the article ...  //CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specificsave/restore operations, and then delegating to a traditional checkpoint-restart package.... In the end, the support of DMTCP for processvirtualization ​and plugins ​[20] makes it easier to add modular support for CUDA without ​hav-ing ​to excessively understand details of the internals of the host checkpointing package//+Some quotes from the article ...  //CRAC provides the ability to save and restore the state of CUDA by first using CUDA-specific save/restore operations, and then delegating to a traditional checkpoint-restart package.... In the end, the support of DMTCP for process virtualization ​and plugins makes it easier to add modular support for CUDA without ​having ​to excessively understand details of the internals of the host checkpointing package.//  This plugin will likely only work with recent gpu models.
  
  
-  ​+  ​* Slurm DMTCP integration with plugin 
 +    * https://​slurm.schedmd.com/​SLUG16/​ciemat-cr.pdf
   ​   ​
  
cluster/198.txt · Last modified: 2020/12/03 08:09 by hmeij07