This is an old revision of the document!
When we move to Openlava 3.x all queues will support checkpointing, which means you can run your job in a “wrapper” and if the job or cluster crashes you can restart your job from last checkpoint file.
Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava).
You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold open files differently, this needs investigating further.
BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large (for example 10G), and static, it is advisable to not save the application inside the checkpoint file.
At the bottom of this page is the v0.1 version of blcr_wrapper
program which will hide the complexity for you. blcr_watcher
is a program that is in your PATH already and will terminate the wrapper if the application done inside of a check point time interval. I will work with any group interested to customize your blcr_wrapper
for your lab/group.
# location [hmeij@petaltail 187]$ pwd /sanscratch/187 # start the application with BLCR [hmeij@petaltail 187]$ cr_run ./a.out & [1] 15559 # the application opens a file [hmeij@petaltail 187]$ ll total 1104 -rwxr--r-- 1 hmeij its 1126428 Mar 15 15:38 a.out -rw-r--r-- 1 hmeij its 0 Mar 15 15:41 fid.txt # wait 5 mins, make a checkpoint file, pid 155559 is a.out running [hmeij@petaltail 187]$ cr_checkpoint --save-all -f /home/hmeij/checkpoints/chk.15559 15559 # the application runs for an hour, after a few mins pull the rug from underneath it [hmeij@petaltail 187]$ pkill -u hmeij Connection to petaltail lost. # that was not too clever, log back in, restart application in another directory [hmeij@petaltail 187]$ cd .. [hmeij@petaltail sanscratch]$ mkdir 188 [hmeij@petaltail sanscratch]$ cd 188 # restart the application from checkpoint file in background # note the relocate directive [1]+ cr_restart --relocate /sanscratch/187=/sanscratch/188 \ --no-restore-pid /home/hmeij/checkpoints/chk.15559 & # note that a.out is missing from directory [hmeij@petaltail 188]$ ll total 0 -rw-r--r-- 1 hmeij its 0 Mar 15 15:45 fid.txt # but a.out is running upon restart # On of these sleep processes is determining when next checkpoint gets created (set in blcr_wrapper) # the other sleep process determines when blcr_watcher next checks if application has finished (every 10 mins) [hmeij@petaltail 188]$ ps PID TTY TIME CMD 24936 ? 00:00:00 res 24942 ? 00:00:00 1458238185.218 24945 ? 00:00:00 1458238185.218. 24960 ? 00:00:00 cr_restart 24961 ? 00:00:00 blcr_watcher 24962 ? 00:00:00 sleep 24964 ? 00:00:09 a.out 24965 ? 00:00:00 sleep 24983 ? 00:00:00 sshd 24984 ? 00:00:00 ps # after an hour [1]+ Done cr_restart --relocate /sanscratch/187=/sanscratch/188 \ --no-restore-pid /home/hmeij/checkpoints/chk.15559 # the result [hmeij@petaltail 188]$ cat fid.txt 0.999577738717693 8.112048998602431E-005
The blcr_wrapper
will perform a “change directory” to $MYSANSCRATCH which is /sanscratch/JOBPID. SO think in those terms. Copy the application (and input data) to '.' using $pre_cmd in the script. If $save_exec=“n”, then upon a restart the script will copy the application back.
You edit the top part of blcr_wrapper
to match your job's needs. Then either the block of START or RESTART is uncommented. When you restart a job (some new JOBPID assigned by cluster) the script needs the old JOBPID of crashed job that has latest checkpoint file in /sanscratch/checkpoints/
Then submit to scheduler as usual
[hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper
blcr_wrapper
v0.1 at /home/hmeij/ynam/blcr_wrapper (copy this file, I've added comments to the script below, may have a typo in it or whatever)#!/bin/bash # at job finish, all content in /sanscratch/JOBPID # will be copied to /sanscratch/checkpoints/JOBPID # clean this area up once in a while # content older than 6 months will be removed # SCHEDULER #BSUB -q test #BSUB -n 1 #BSUB -J test #BSUB -o out #BSUB -e err # CHECK POINT TIME INTERVAL: 6h 12h 18h 1d cpti=6h # COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) # always stage the application (and data if needed) # if application is large and static save_exec="n" pre_cmd=" scp $HOME/ynam/a.out . " # IF START OF JOB, UNCOMMENT # its either start or restart mode=start queue=test cmd="./a.out" save_exec="y" # IF RESTART OF JOB, UNCOMMENT # you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ #mode=restart #queue=test #orgjobpid=188 ############### NOTHING TO EDIT BELOW THIS LINE ################## # checkpoints checkpoints=/sanscratch/checkpoints # work directory, avoid /home, MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH cd $MYSANSCRATCH # kernel modules mods=`lsmod | grep ^blcr | wc -l` if [ $mods -ne 2 ]; then echo "Error: BLCR modules not loaded on `hostname`" exit fi # blcr setup restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid" if [ $save_exec == "n" ]; then save_options="--save-private --save-shared" else save_options="--save-all" fi export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH # setup checkpoints dir if [ ! -d $checkpoints/$LSB_JOBID ]; then mkdir -p $checkpoints/$LSB_JOBID else echo "Error: $checkpoints/$LSB_JOBID already exists, exciting" exit fi # save process id and path and start application if [ "$mode" == "start" ]; then $pre_cmd cr_run $cmd & pid=$! pwd > $checkpoints/$LSB_JOBID/pwd.$pid orgjobpid=0 # otherwise restart the job elif [ "$mode" == "restart" ]; then orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'` orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid` scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/ if [ $save_exec == "n" ]; then $pre_cmd fi # why? --save-all? scp $checkpoints/$orgjobpid/fid.txt $MYSANSCRATCH cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/$orgjobpid/chk.$orgpid & pid=$! pwd > $checkpoints/$LSB_JOBID/pwd.$pid # obviously else echo "Error: startup mode not defined correctly" exit fi # if $cmd disappears during $pcit, terminate wrapper blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & # always run this block while [ true ]; do # checkpoint time interval sleep $cpti # checkpoint file outside of sanscratch cr_checkpoint $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/ # why? --save-all? scp $MYSANSCRATCH/fid.txt $checkpoints/$LSB_JOBID/ done
blcr_watcher
v01 at /share/apps/bin/#!/bin/bash # watch a process during check point time interval # if it disappears, terminate the blcr_wrapper checkpoints=/sanscratch/checkpoints watch_pid=$1 watch_wrapper=$2 jobpid=$3 orgjobpid=$4 while [ $watch_pid -gt 0 ]; do sleep 600 nopid=`ps -u $USER | grep $watch_pid | awk '{print $1}'` if [ "${nopid}x" == 'x' ]; then # save output scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/ if [ $orgjobpid -gt 0 ]; then rm -f $HOME/.lsbatch/*.$orgjobpid.err $HOME/.lsbatch/*.$orgjobpid.out fi $POST_CMD kill $watch_wrapper exit; fi done