This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
cluster:190 [2020/01/14 19:50] hmeij07 created |
cluster:190 [2020/01/17 15:23] hmeij07 [DMTCP] |
||
---|---|---|---|
Line 4: | Line 4: | ||
==== DMTCP ==== | ==== DMTCP ==== | ||
+ | * https:// | ||
+ | * DMTCP (Distributed MultiThreaded Checkpointing) | ||
+ | |||
+ | DMTCP Checkpoint/ | ||
+ | a distributed computation. | ||
+ | to the Linux kernel nor to the application binaries. | ||
+ | unprivileged users (no root privilege needed). | ||
+ | from a checkpoint, or even migrate the processes by moving the checkpoint | ||
+ | files to another host prior to restarting. | ||
+ | |||
+ | This is a replacement for THE BLCR methods we used to use. | ||
+ | |||
+ | < | ||
+ | |||
+ | # sample steps to put in your scripts | ||
+ | |||
+ | # make a directory (done by scheduler) | ||
+ | mkdir -p / | ||
+ | cd / | ||
+ | time ./a.out | ||
+ | |||
+ | [1]+ Done time ./ | ||
+ | real 64m0.513s | ||
+ | user 64m0.813s | ||
+ | sys | ||
+ | |||
+ | total 1108 | ||
+ | -rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out | ||
+ | -rw-r--r-- 1 hmeij its 49 Jan 14 15:41 fid.txt | ||
+ | |||
+ | [hmeij@cottontail2 111]$ cat fid.txt | ||
+ | 0.999578495067887 | ||
+ | | ||
+ | # run command below with interval 300 (every 5 mins) | ||
+ | [hmeij@cottontail2 111]$ ps | ||
+ | PID TTY TIME CMD | ||
+ | 20004 pts/1 00:00:00 time | ||
+ | 20008 pts/1 00:00:00 dmtcp_coordinat | ||
+ | 20010 pts/1 00:03:02 a.out | ||
+ | 20060 pts/1 00:00:00 ps | ||
+ | 20891 pts/1 00:00:00 bash | ||
+ | |||
+ | # the coordinator to which application process was attached | ||
+ | 1 S hmeij 20008 | ||
+ | / | ||
+ | # the random port (in case somebody else is also checkpointing on this host | ||
+ | [hmeij@cottontail2 111]$ cat port.txt | ||
+ | 38913 | ||
+ | |||
+ | # 5 mins checkpoints, | ||
+ | [hmeij@cottontail2 111]$ sleep 6m; ll / | ||
+ | total 2960 | ||
+ | -rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\ | ||
+ | -rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | [hmeij@cottontail2 111]$ | ||
+ | real 63m57.287s | ||
+ | user 63m54.703s | ||
+ | sys | ||
+ | |||
+ | # lets try a termination of processes and restart | ||
+ | [hmeij@cottontail2 111]$ time dmtcp_launch \\ | ||
+ | --new-coordinator --coord-port 0 --port-file port.txt \\ | ||
+ | -i 300 --ckptdir / | ||
+ | [1] 29201 | ||
+ | [hmeij@cottontail2 111]$ ps | ||
+ | PID TTY TIME CMD | ||
+ | 20891 pts/1 00:00:00 bash | ||
+ | 29201 pts/1 00:00:00 bash | ||
+ | 29202 pts/1 00:00:01 a.out | ||
+ | 29210 pts/1 00:00:00 dmtcp_coordinat | ||
+ | 29212 pts/1 00:00:00 ps | ||
+ | [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 | ||
+ | |||
+ | # restart from checkpoints directory (or copy files into desired location) | ||
+ | cd / | ||
+ | ./ | ||
+ | # ps | ||
+ | 0 S hmeij 20891 20890 -bash | ||
+ | 0 R hmeij 31136 20891 [DMTCP: | ||
+ | 1 S hmeij 31201 | ||
+ | |||
+ | |||
+ | |||
+ | # launch new coordinator on random port, log port, 24 hour checkpoints | ||
+ | # make sure you create destination dir for checkpoints | ||
+ | |||
+ | dmtcp_launch --new-coordinator \ | ||
+ | --coord-port 0 --port-file port.txt --interval 86400 \ | ||
+ | --ckptdir / | ||
+ | time ./a.out | ||
+ | |||
+ | |||
+ | </ | ||
==== Quick-Start Guide ==== | ==== Quick-Start Guide ==== | ||