Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:198 [DokuWiki]

User Tools

Site Tools


cluster:198

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:198 [2020/12/02 09:30]
hmeij07 [GPU checkpoint/restart]
cluster:198 [2020/12/02 14:22]
hmeij07 [GPU checkpoint/restart]
Line 4: Line 4:
 ===== GPU checkpoint/restart ===== ===== GPU checkpoint/restart =====
  
-Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with [[cluster:190|DMTCP]] for serial and parallel jobs (including multi-host). +Why I thought this was an easy problem to solve I do not know. CPU checkpoint/restart has come a long way with [[cluster:190|DMTCP]] for serial and parallel jobs (including multi-host). But the CPU/GPU environment adds much complexity.
  
-A good overview of the history of GPU checkpoint/restart efforts can be found at this presentation+A good **overview** of the history of GPU checkpoint/restart efforts can be found at this presentation
  
   * [[https://on-demand.gputechconf.com/gtc/2016/presentation/s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]]   * [[https://on-demand.gputechconf.com/gtc/2016/presentation/s6429-akira-nukada-transparen-checkpoint-restart-technology-cuda-applications.pdf|Transparent Checkpoint and Restart Technology for CUDA® applications]]
Line 19: Line 19:
  
  
-  +  * Slurm DMTCP integration via plugin 
 +    * https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf
      
  
cluster/198.txt · Last modified: 2020/12/03 08:09 by hmeij07