This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:190 [2020/01/15 12:53] hmeij07 [DMTCP] |
cluster:190 [2020/01/21 19:34] hmeij07 |
||
---|---|---|---|
Line 14: | Line 14: | ||
files to another host prior to restarting. | files to another host prior to restarting. | ||
- | This is a replacement for THE BLCR methods we used to use. | + | This is a replacement for THE BLCR methods we used to used [[cluster: |
< | < | ||
Line 23: | Line 23: | ||
mkdir -p / | mkdir -p / | ||
cd / | cd / | ||
- | time ./ | + | time ./a.out |
- | [1]+ Done time ./ | + | [1]+ Done time ./ |
real 64m0.513s | real 64m0.513s | ||
user 64m0.813s | user 64m0.813s | ||
Line 32: | Line 32: | ||
total 1108 | total 1108 | ||
-rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out | -rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out | ||
- | -rw-r--r-- 1 hmeij its 0 Jan 14 14:37 a.out.time | ||
-rw-r--r-- 1 hmeij its 49 Jan 14 15:41 fid.txt | -rw-r--r-- 1 hmeij its 49 Jan 14 15:41 fid.txt | ||
Line 38: | Line 37: | ||
0.999578495067887 | 0.999578495067887 | ||
| | ||
- | # run command below with interval | + | # run command below with interval |
[hmeij@cottontail2 111]$ ps | [hmeij@cottontail2 111]$ ps | ||
PID TTY TIME CMD | PID TTY TIME CMD | ||
Line 48: | Line 47: | ||
# the coordinator to which application process was attached | # the coordinator to which application process was attached | ||
- | 1 S hmeij 20008 | + | 1 S hmeij 20008 |
+ | / | ||
# the random port (in case somebody else is also checkpointing on this host | # the random port (in case somebody else is also checkpointing on this host | ||
[hmeij@cottontail2 111]$ cat port.txt | [hmeij@cottontail2 111]$ cat port.txt | ||
38913 | 38913 | ||
+ | |||
+ | # 5 mins checkpoints, | ||
+ | [hmeij@cottontail2 111]$ sleep 6m; ll / | ||
+ | total 2960 | ||
+ | -rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\ | ||
+ | -rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
+ | [hmeij@cottontail2 111]$ | ||
+ | real 63m57.287s | ||
+ | user 63m54.703s | ||
+ | sys | ||
+ | |||
+ | # lets try a termination of processes and restart | ||
+ | [hmeij@cottontail2 111]$ time dmtcp_launch \\ | ||
+ | --new-coordinator --coord-port 0 --port-file port.txt \\ | ||
+ | -i 300 --ckptdir / | ||
+ | [1] 29201 | ||
+ | [hmeij@cottontail2 111]$ ps | ||
+ | PID TTY TIME CMD | ||
+ | 20891 pts/1 00:00:00 bash | ||
+ | 29201 pts/1 00:00:00 bash | ||
+ | 29202 pts/1 00:00:01 a.out | ||
+ | 29210 pts/1 00:00:00 dmtcp_coordinat | ||
+ | 29212 pts/1 00:00:00 ps | ||
+ | [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 | ||
+ | |||
+ | # restart from checkpoints directory (or copy files into desired location) | ||
+ | cd / | ||
+ | ./ | ||
+ | # ps | ||
+ | 0 S hmeij 20891 20890 -bash | ||
+ | 0 R hmeij 31136 20891 [DMTCP: | ||
+ | 1 S hmeij 31201 | ||
+ | |||
+ | # You must make sure the old directory and file exists, otherwise | ||
+ | [40000] ERROR at fileconnection.cpp: | ||
+ | _path = / | ||
+ | Message: File not found. | ||
+ | a.out (40000): Terminating... | ||
+ | # that is because at checkpoint time that file was opened by a.out | ||