\\
**[[cluster:0|Back]]**
==== DMTCP ====
* https://sourceforge.net/projects/dmtcp/
* DMTCP (Distributed MultiThreaded Checkpointing)
DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation. It works under Linux, with no modifications to the Linux kernel nor to the application binaries. It can be used by
unprivileged users (no root privilege needed). One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior to restarting.
This is a replacement for THE BLCR methods we used [[cluster:147|BLCR Checkpoint in OL3 -serial]] or [[cluster:148|BLCR Checkpoint in OL3 - parallel]] ... BLCR is not being developed anymore. Today's brief power outage removed the BLCR kernel module HPCC wide. So learn DMTCP. I have not provided wrappers but you can follow the same logic as we used with BLCR.
Write your checkpoint files in ''/sanscratch/checkpoints/JOBPID'' so it does not add into your quota. The scheduler will not create this directory for you, you must do this in your submit job. Directories will automatically be delete if 120 days old.
# sample steps to put in your scripts
# make a directory (first one done by scheduler, second one done by your script)
mkdir -p /sanscratch/111 /sanscratch/checkpoints/111
cd /sanscratch/111
# invoke sample rpogram which generates one line of output
time ./a.out
[1]+ Done time ./a.out
real 64m0.513s
user 64m0.813s
sys 0m0.022s
total 1108
-rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out
-rw-r--r-- 1 hmeij its 49 Jan 14 15:41 fid.txt
[hmeij@cottontail2 111]$ cat fid.txt
0.999578495067887 7.978578277139264E-005
# for example, with 24 hours checkpoint interval
# launch new coordinator on random port, log port, 24 hour checkpoints
# make sure you create destination dir for checkpoints
dmtcp_launch --new-coordinator \
--coord-port 0 --port-file port.txt --interval 86400 \
--ckptdir /sanscratch/checkpoints/111 \
time ./a.out
# run command above with interval 300 (every 5 mins)
[hmeij@cottontail2 111]$ ps
PID TTY TIME CMD
20004 pts/1 00:00:00 time
20008 pts/1 00:00:00 dmtcp_coordinat
20010 pts/1 00:03:02 a.out
20060 pts/1 00:00:00 ps
20891 pts/1 00:00:00 bash
# the coordinator to which application process was attached
1 S hmeij 20008 1 0 80 0 - 4665 ep_pol 07:46 pts/1 00:00:00 \\
/usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
# the random port (in case somebody else is also checkpointing on this host
[hmeij@cottontail2 111]$ cat port.txt
38913
# 5 mins checkpoints, little impact
[hmeij@cottontail2 111]$ sleep 6m; ll /sanscratch/checkpoints/111/
total 2960
-rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\
-rwxr--r-- 1 hmeij its 12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
lrwxrwxrwx 1 hmeij its 60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
[hmeij@cottontail2 111]$
real 63m57.287s
user 63m54.703s
sys 0m0.444s
# lets try a termination of processes and restart
[hmeij@cottontail2 111]$ time dmtcp_launch \\
--new-coordinator --coord-port 0 --port-file port.txt \\
-i 300 --ckptdir /sanscratch/checkpoints/111 ./a.out &
[1] 29201
[hmeij@cottontail2 111]$ ps
PID TTY TIME CMD
20891 pts/1 00:00:00 bash
29201 pts/1 00:00:00 bash
29202 pts/1 00:00:01 a.out
29210 pts/1 00:00:00 dmtcp_coordinat
29212 pts/1 00:00:00 ps
# terminate half way through
[hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210
# restart from checkpoints directory (or copy files into desired location)
cd /sanscratch/checkpoints/111
./dmtcp_restart_script.sh
# ps
0 S hmeij 20891 20890 -bash
0 R hmeij 31136 20891 [DMTCP:a.out]
1 S hmeij 31201 1 /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
# You must make sure the old directory and file exists, otherwise
[40000] ERROR at fileconnection.cpp:737 in refill;
REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
_path = /sanscratch/111/fid.txt
Message: File not found.
a.out (40000): Terminating...
# that is because at checkpoint time that file was opened by a.out
# The process will pick up from last checkpoint
# and write output to original work directory
==== Quick-Start Guide ====
From tarball. Handy reference.
# Overview of DMTCP
To install DMTCP, see [INSTALL.md](INSTALL.md).
## Concepts:
DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk
a distributed computation. It works under Linux, with no modifications
to the Linux kernel nor to the application binaries. It can be used by
unprivileged users (no root privilege needed). One can later restart
from a checkpoint, or even migrate the processes by moving the checkpoint
files to another host prior to restarting.
There is one DMTCP coordinator for each computation that you wish to
checkpoint. By specifying `--coord-host` and `--coord-port` (or the environment
variables `DMTCP_COORD_HOST` and `DMTCP_COORD_PORT`), you can add a process
to a coordinator different from the default coordinator. If you don't
specify, the default coordinator is always at (`localhost:7779`).
A DMTCP coordinator process is started on one host. Application binaries
are started under the `dmtcp_launch` command, causing them to connect
to the coordinator upon startup. As threads are spawned, child processes
are forked, remote processes are spawned via ssh, libraries are dynamically
loaded, DMTCP transparently and automatically tracks them.
By default, DMTCP uses gzip to compress the checkpoint images. This can
be turned off (`dmtcp_launch --no-gzip` ; or setting an
environment variable to 0: `DMTCP_GZIP=0`). This will be faster, and if
your memory is dominated by incompressible data, this can be helpful.
Gzip can add seconds for large checkpoint images. Typically, checkpoint
and restart is less than one second without gzip.
A DMTCP checkpoint image includes any libraries (`.so` files) that it may
have been using. This strategy is used for greater portability of
the checkpoint images --- and in some cases, it even allows migration of
the checkpoint images (and hence, processes) to hosts with different
Linux distributions, different Linux kernels, etc.
To run a program with checkpointing:
1. Run `dmtcp_coordinator` in a separate terminal/window
```
bin/dmtcp_coordinator
```
2. In separate terminal(s), replace each command(s) with `dmtcp_launch [command]`
```
bin/dmtcp_launch ./a.out
```
3. To checkpoint, type `c` into `dmtcp_coordinator`
In `dmtcp_coordinator` window:
```
h for help
c for checkpoint
l for list of processes to be checkpointed
k to kill processes to be checkpointed
q to kill processes to be checkpointed and quit the coordinator
```
4. Restart:
Creating a checkpoint causes the `dmtcp_coordinator` to write
a script, `dmtcp_restart_script.sh`, along with a
checkpoint file (file type: `.dmtcp`) for each client process.
The simplest way to restart a previously checkpointed computation is:
```
./bin/dmtcp_restart_script.sh
```
* `./dmtcp_restart_script.sh` usually works "as is", but it can be edited.
* Alternatively, if all processes were on the same processor,
and there were no .dmtcp files prior to this checkpoint:
```
./bin/dmtcp_restart ckpt_*.dmtcp
```
## Convenience commands and debugging restarted processes:
1. Help exists:
```bash
# bin/dmtcp_coordinator --help ; bin/dmtcp_launch --help ;
# bin/dmtcp_command --help, etc.
# Automatically start a coordinator in background
bin/dmtcp_launch ./a.out &
# Checkpoint all processes of the default coordinator
bin/dmtcp_command --checkpoint
# Kill a.out, and optionally kill coordinator process
bin/dmtcp_command --kill
# Kill a.out, and optionally kill coordinator process
bin/dmtcp_command --quit
# Restart directly from local checkpoint images (.dmtcp files)
./dmtcp_restart_script.sh
# Or else, directly restart from the ckpt images in the current directory.
# (Be sure there are no old ckpt_a.out_*.dmtcp files.
# Ensure that the restarted process is running, and not suspended.)
bin/dmtcp_restart ckpt_a.out_*.dmtcp &
# Have gdb attach to a restarted process, and debug
# NOTE: You must specify 'mtcp_restart', not 'dmtcp_restart'
gdb ./a.out `pgrep -n MTCP`
# force a.out to exit any low level libraries and return to a known location
# set a breakpoint on a common function and continue:
(gdb) break write
(gdb) continue
```
2. To enable debug statements for DMTCP, configure with: `./configure
--enable-debug` (or `./configure --help`, in general).
The flag `--enable-debug` both prints to stderr and writes files.
```
$DMTCP_TMPDIR/dmtcp-$USER@$HOST/jassertlog.*
```
where `$DMTCP_TMPDIR` is `/tmp` by default on most distributions.
In reading this, it's useful to know that
DMTCP sets up barriers so that all processes proceed to the
following states together during checkpoint: `RUNNING`, `SUSPENDED`,
`FD_LEADER_ELECTION`, `DRAINED`, `CHECKPOINTED`, `REFILLED`, `RUNNING`.
3. `util/gdb-add-symbol-file` may be a useful debugging tool. It computes
the arguments for the `add-symbol-file` command of gdb, to import
symbol information about a dynamic library. It is most useful in
combination with *-dbg Linux packages and prefix to `dmtcp_launch`:
```
env LD_LIBRARY_PATH=/usr/lib/debug dmtcp_launch ...
```
followed by `attach` in gdb.
## Command-line options:
`dmtcp_launch`, `dmtcp_command`, and `dmtcp_restart` print
their options when run with no command-line arguments. `dmtcp_coordinator`
offers help when run (Type `h` for help.).
Options through environment variables:
1. `dmtcp_coordinator`:
* `DMTCP_CHECKPOINT_INTERVAL=
\\
**[[cluster:0|Back]]**