This is an old revision of the document!
So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress. To do that, we need to learn to checkpoint at the application level. If a node crashes or power is lost, those applications can then restart the job from the last checkpoint.
I've decided to support one checkpoint/restart utility, The Berkeley Laboratory Checkpoint/Restart tool. Hence this page.
BLCR consists of two kernel modules, some user-level libraries, and several command-line executables. No kernel patching is required. Modules are loading upon boot via /etc/rc.local. The modules are dependent on the kernel source where the compilation took place. So for our first supported BLRC modules I've chosen the mw256 queue and nodes. Here is some documentation on BLCR
First lets test on a node to grasp the concept.
# are modules loaded [hmeij@n33 blcr]$ lsmod | grep blcr blcr 115529 0 blcr_imports 10715 1 blcr # set env export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH # is it all working [hmeij@n33 blcr]$ cr_checkpoint --help Usage: cr_checkpoint [options] ID Options: General options: -v, --verbose print progress messages to stderr. -q, --quiet suppress error/warning messages to stderr. -?, --help print this message and exit. --version print version information and exit. ... # and here is our application and output (one extra character per second) [hmeij@n33 blcr]$ ./t-20001030-01 * ** *** **** ***** ****** ...
So now lets run this under BLCR and observe what happens. After we define the proper environment, we use cr_run
to launch our application. Standard output and error are written into the output file context
. We then observe the PID of our process and use cr_checkpoint
to write a checkpoint file and immediately terminate the process.
# start application [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 & [1] 12789 # observe PID [hmeij@n33 blcr]$ ps PID TTY TIME CMD 12789 pts/29 00:00:00 t-20001030-01 12817 pts/29 00:00:00 ps 28257 pts/29 00:00:00 bash # wait, then checkpoint and terminate process [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ cr_checkpoint --term 12789 [1]+ Terminated cr_run ./t-20001030-01 > context 2>&1 # save the output [hmeij@n33 blcr]$ mv context context.save
Ok. Next we use cr_restart
to restart our application by pointing it to the checkpoint file generated. Then we'll wait a bit and terminate the restart.
# restart in background [hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 & [1] 13579 # wait and terminate the restart [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ kill %1 [1]+ Terminated cr_restart ./context.12789 > context 2>&1
So what we're interested in is the boundary between first termination and subsequent restart. It alooks like this:
[hmeij@n33 blcr]$ tail context.save ***************************************** ****************************************** ******************************************* ******************************************** ********************************************* ********************************************** *********************************************** ************************************************ ************************************************* ************************************************** [hmeij@n33 blcr]$ head context *************************************************** **************************************************** ***************************************************** ****************************************************** ******************************************************* ******************************************************** ********************************************************* ********************************************************** *********************************************************** ************************************************************ # pretty nifty! # but be forewarned that there are binary characters lurking at this boundary # you can strip them out with ''sed'' or ''tr'' # it looks like this ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************
Now we can write a batch script for the scheduler. We need to do several things
run.serial
#!/bin/bash # submit via 'bsub < run.serial' rm -f *err *out *shell #BSUB -q mw256chkpnt #BSUB -n 1 #BSUB -J test #BSUB -o out #BSUB -e err export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH # checkpoint directory is /sanscratch/JOBPID MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH cd $MYSANSCRATCH # stage the application (plus data if needed) cp -rp ~/blcr/t-20001030-01 . # start the application and remeber the working directory cr_run ./t-20001030-01 > context 2>&1 & process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'` pwd > pwd.$process_id # on restart, give cr_restart some time to set up # WARNING: it will overwrite the checkpoint file, save it # you need to find the process_id and supply it #process_id=9089 #cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved #mv ~/blcr/context ~/blcr/context.save #ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id` #cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 & #sleep 60 echo "process_id=$process_id" while [ $process_id -gt 0 ]; do # checkpoint time interval, make it an hour or larger (small for testing) sleep 120 # save the checkpoint file outside of sanscratch cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id # if the application has crashed, exit process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'` if [ "${process_id}x" = 'x' ]; then # save some stuff for checking cp -p pwd* *.shell *.out *.err context ~/blcr/ rm -f `cat ~/blcr/pwd.$process_id` exit; fi done