cluster:190
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| cluster:190 [2020/01/17 13:07] – hmeij07 | cluster:190 [2020/09/28 11:38] (current) – hmeij07 | ||
|---|---|---|---|
| Line 7: | Line 7: | ||
| * DMTCP (Distributed MultiThreaded Checkpointing) | * DMTCP (Distributed MultiThreaded Checkpointing) | ||
| - | DMTCP Checkpoint/ | + | DMTCP Checkpoint/ |
| - | a distributed computation. | + | unprivileged users (no root privilege needed). |
| - | to the Linux kernel nor to the application binaries. | + | |
| - | unprivileged users (no root privilege needed). | + | |
| - | from a checkpoint, or even migrate the processes by moving the checkpoint | + | |
| - | files to another host prior to restarting. | + | |
| - | This is a replacement for THE BLCR methods we used to use. | + | This is a replacement for THE BLCR methods we used [[cluster: |
| + | |||
| + | Write your checkpoint files in ''/ | ||
| < | < | ||
| Line 20: | Line 18: | ||
| # sample steps to put in your scripts | # sample steps to put in your scripts | ||
| - | # make a directory (done by scheduler) | + | # make a directory (first one done by scheduler, second one done by your script) |
| - | mkdir -p / | + | mkdir -p / |
| cd / | cd / | ||
| + | |||
| + | # invoke sample rpogram which generates one line of output | ||
| time ./a.out | time ./a.out | ||
| Line 36: | Line 36: | ||
| [hmeij@cottontail2 111]$ cat fid.txt | [hmeij@cottontail2 111]$ cat fid.txt | ||
| 0.999578495067887 | 0.999578495067887 | ||
| + | |||
| + | # for example, with 24 hours checkpoint interval | ||
| + | # launch new coordinator on random port, log port, 24 hour checkpoints | ||
| + | # make sure you create destination dir for checkpoints | ||
| + | dmtcp_launch --new-coordinator \ | ||
| + | --coord-port 0 --port-file port.txt --interval 86400 \ | ||
| + | --ckptdir / | ||
| + | time ./a.out | ||
| + | |||
| | | ||
| - | # run command | + | # run command |
| [hmeij@cottontail2 111]$ ps | [hmeij@cottontail2 111]$ ps | ||
| PID TTY TIME CMD | PID TTY TIME CMD | ||
| Line 49: | Line 59: | ||
| 1 S hmeij 20008 | 1 S hmeij 20008 | ||
| / | / | ||
| + | |||
| # the random port (in case somebody else is also checkpointing on this host | # the random port (in case somebody else is also checkpointing on this host | ||
| [hmeij@cottontail2 111]$ cat port.txt | [hmeij@cottontail2 111]$ cat port.txt | ||
| Line 59: | Line 70: | ||
| -rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | -rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
| lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ | ||
| + | |||
| [hmeij@cottontail2 111]$ | [hmeij@cottontail2 111]$ | ||
| real 63m57.287s | real 63m57.287s | ||
| Line 69: | Line 81: | ||
| -i 300 --ckptdir / | -i 300 --ckptdir / | ||
| [1] 29201 | [1] 29201 | ||
| + | |||
| [hmeij@cottontail2 111]$ ps | [hmeij@cottontail2 111]$ ps | ||
| PID TTY TIME CMD | PID TTY TIME CMD | ||
| Line 76: | Line 89: | ||
| 29210 pts/1 00:00:00 dmtcp_coordinat | 29210 pts/1 00:00:00 dmtcp_coordinat | ||
| 29212 pts/1 00:00:00 ps | 29212 pts/1 00:00:00 ps | ||
| + | |||
| + | # terminate half way through | ||
| [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 | [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 | ||
| + | # restart from checkpoints directory (or copy files into desired location) | ||
| + | cd / | ||
| + | ./ | ||
| - | # launch new coordinator on random port, log port, 24 hour checkpoints | + | # ps |
| - | # make sure you create destination dir for checkpoints | + | 0 S hmeij 20891 20890 -bash |
| + | 0 R hmeij 31136 20891 [DMTCP: | ||
| + | 1 S hmeij 31201 | ||
| - | dmtcp_launch --new-coordinator \ | + | # You must make sure the old directory and file exists, otherwise |
| - | --coord-port 0 --port-file port.txt --interval 86400 \ | + | [40000] ERROR at fileconnection.cpp:737 in refill; |
| - | | + | REASON=' |
| - | | + | _path = / |
| + | Message: File not found. | ||
| + | a.out (40000): Terminating... | ||
| + | # that is because at checkpoint time that file was opened by a.out | ||
| + | |||
| + | # The process will pick up from last checkpoint | ||
| + | # and write output to original work directory | ||
| </ | </ | ||
| + | |||
| ==== Quick-Start Guide ==== | ==== Quick-Start Guide ==== | ||
cluster/190.1579266475.txt.gz · Last modified: by hmeij07
