This is an old revision of the document!
So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress. To do that, we need to learn to checkpoint at the application level. If a node crashes or power is lost, those applications can then restart the job from the last checkpoint.
I've decided to support one checkpoint/restart utility, The Berkeley Laboratory Checkpoint/Restart tool. Hence this page.
BLCR consists of two kernel modules, some user-level libraries, and several command-line executables. No kernel patching is required. Modules are loading upon boot via /etc/rc.local. The modules are dependent on the kernel source where the compilation took place. So for our first supported BLRC modules I've chosen the mw256 queue and nodes. Here is some documentation on BLCR
# are modules loaded [hmeij@n33 blcr]$ lsmod | grep blcr blcr 115529 0 blcr_imports 10715 1 blcr # set env export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH # is it all working [hmeij@n33 blcr]$ cr_checkpoint --help Usage: cr_checkpoint [options] ID Options: General options: -v, --verbose print progress messages to stderr. -q, --quiet suppress error/warning messages to stderr. -?, --help print this message and exit. --version print version information and exit. ...
[hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 & [1] 12789
[hmeij@n33 blcr]$ ps
PID TTY TIME CMD
12789 pts/29 00:00:00 t-20001030-01 12817 pts/29 00:00:00 ps 28257 pts/29 00:00:00 bash
[hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ cr_checkpoint –term 12789 [1]+ Terminated cr_run ./t-20001030-01 > context 2>&1
[hmeij@n33 blcr]$ mv context context.save
[hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 & [1] 13579 [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ kill %1 [1]+ Terminated cr_restart ./context.12789 > context 2>&1
[hmeij@n33 blcr]$ tail context.save * * * * * [hmeij@n33 blcr]$ head context * * * * *
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |
@ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ | @ |