This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:190 [2020/01/16 13:05] hmeij07 |
cluster:190 [2020/01/21 19:37] hmeij07 |
||
---|---|---|---|
Line 7: | Line 7: | ||
* DMTCP (Distributed MultiThreaded Checkpointing) | * DMTCP (Distributed MultiThreaded Checkpointing) | ||
- | DMTCP Checkpoint/ | + | DMTCP Checkpoint/ |
- | a distributed computation. | + | unprivileged users (no root privilege needed). |
- | to the Linux kernel nor to the application binaries. | + | |
- | unprivileged users (no root privilege needed). | + | |
- | from a checkpoint, or even migrate the processes by moving the checkpoint | + | |
- | files to another host prior to restarting. | + | |
- | This is a replacement for THE BLCR methods we used to use. | + | This is a replacement for THE BLCR methods we used [[cluster: |
< | < | ||
Line 20: | Line 16: | ||
# sample steps to put in your scripts | # sample steps to put in your scripts | ||
- | # make a directory (done by scheduler) | + | # make a directory (first one done by scheduler, second one done by your script) |
mkdir -p / | mkdir -p / | ||
cd / | cd / | ||
Line 37: | Line 33: | ||
0.999578495067887 | 0.999578495067887 | ||
| | ||
- | # run command below with interval | + | # run command below with interval |
[hmeij@cottontail2 111]$ ps | [hmeij@cottontail2 111]$ ps | ||
PID TTY TIME CMD | PID TTY TIME CMD | ||
Line 47: | Line 43: | ||
# the coordinator to which application process was attached | # the coordinator to which application process was attached | ||
- | 1 S hmeij 20008 | + | 1 S hmeij 20008 |
+ | / | ||
# the random port (in case somebody else is also checkpointing on this host | # the random port (in case somebody else is also checkpointing on this host | ||
[hmeij@cottontail2 111]$ cat port.txt | [hmeij@cottontail2 111]$ cat port.txt | ||
38913 | 38913 | ||
- | 3840.51 | + | # 5 mins checkpoints, |
- | 0.07 system | + | [hmeij@cottontail2 111]$ sleep 6m; ll / |
- | 1:03:59 elapsed | + | total 2960 |
+ | -rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\ | ||
+ | -rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | [hmeij@cottontail2 111]$ | ||
+ | real 63m57.287s | ||
+ | user | ||
+ | sys | ||
+ | |||
+ | # lets try a termination of processes and restart | ||
+ | [hmeij@cottontail2 111]$ time dmtcp_launch \\ | ||
+ | --new-coordinator --coord-port | ||
+ | -i 300 --ckptdir / | ||
+ | [1] 29201 | ||
+ | [hmeij@cottontail2 111]$ ps | ||
+ | PID TTY TIME CMD | ||
+ | 20891 pts/1 00:00:00 bash | ||
+ | 29201 pts/1 00:00:00 bash | ||
+ | 29202 pts/1 00:00:01 a.out | ||
+ | 29210 pts/1 00:00:00 dmtcp_coordinat | ||
+ | 29212 pts/1 00:00:00 ps | ||
+ | [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 | ||
+ | |||
+ | # restart from checkpoints directory (or copy files into desired location) | ||
+ | cd / | ||
+ | ./ | ||
+ | # ps | ||
+ | 0 S hmeij 20891 20890 -bash | ||
+ | 0 R hmeij 31136 20891 [DMTCP: | ||
+ | 1 S hmeij 31201 | ||
+ | |||
+ | # You must make sure the old directory and file exists, otherwise | ||
+ | [40000] ERROR at fileconnection.cpp: | ||
+ | _path = / | ||
+ | Message: File not found. | ||
+ | a.out (40000): Terminating... | ||
+ | # that is because at checkpoint time that file was opened by a.out | ||
# launch new coordinator on random port, log port, 24 hour checkpoints | # launch new coordinator on random port, log port, 24 hour checkpoints |