User Tools

Site Tools


cluster:147


Back

BLCR Checkpoint in OL3

Deprecated since we did OS upgrades OS Update
We will install DMTCP as a replacement…DMTCP

Henk 2020/01/14 14:28

  • This page concerns SERIAL jobs only; SERIAL jobs can restart on any node
  • Installation and what it does BLCR

All queues will support checkpointing, which means you can run your job in a “wrapper” and if the job or cluster crashes you can restart your job from last checkpoint file.

Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR.

You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should open files differently, this needs investigating further.

BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files scheduler generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file.

At the bottom of this page is the current version of blcr_wrapper program which will hide the complexity for you. blcr_watcher is a program that is in your PATH already and will terminate the wrapper if the application finishes inside of a check point time interval. I will work with any group interested to customize your blcr_wrapper for your lab/group.

  • Here is an interactive simple sample run.
# location
[hmeij@petaltail 187]$ pwd
/sanscratch/187

# start the application with BLCR
[hmeij@petaltail 187]$ cr_run ./a.out &
[1] 15559

# the application opens a file
[hmeij@petaltail 187]$ ll
total 1104
-rwxr--r-- 1 hmeij its 1126428 Mar 15 15:38 a.out
-rw-r--r-- 1 hmeij its       0 Mar 15 15:41 fid.txt

# wait 5 mins, make a checkpoint file, pid 155559 is a.out running
[hmeij@petaltail 187]$ cr_checkpoint --save-all -f /home/hmeij/checkpoints/chk.15559 15559

#  the application runs for an hour, after a few mins pull the rug from underneath it
[hmeij@petaltail 187]$ pkill -u hmeij
Connection to petaltail lost.

# ooops, that was not too clever, log back in, restart application in another directory
[hmeij@petaltail 187]$ cd ..
[hmeij@petaltail sanscratch]$ mkdir 188
[hmeij@petaltail sanscratch]$ cd 188

# restart the application from checkpoint file in background
# note the relocate directive
[1]+ cr_restart --relocate /sanscratch/187=/sanscratch/188 \
--no-restore-pid /home/hmeij/checkpoints/chk.15559 &

# note that a.out is missing from directory
[hmeij@petaltail 188]$ ll
total 0
-rw-r--r-- 1 hmeij its 0 Mar 15 15:45 fid.txt

# but a.out is running upon restart 
# On of these sleep processes is determining when next checkpoint gets created (set in blcr_wrapper)
# the other sleep process determines when blcr_watcher next checks if application has finished (every 10 mins)

[hmeij@petaltail 188]$ ps
  PID TTY          TIME CMD                                                      
24936 ?        00:00:00 res                                                      
24942 ?        00:00:00 1458238185.218                                           
24945 ?        00:00:00 1458238185.218.                                          
24960 ?        00:00:00 cr_restart                                               
24961 ?        00:00:00 blcr_watcher                                             
24962 ?        00:00:00 sleep                                                    
24964 ?        00:00:09 a.out                                                    
24965 ?        00:00:00 sleep                                                    
24983 ?        00:00:00 sshd                                                     
24984 ?        00:00:00 ps 

# after an hour
[1]+  Done                    cr_restart --relocate /sanscratch/187=/sanscratch/188 \
 --no-restore-pid /home/hmeij/checkpoints/chk.15559

# the result
[hmeij@petaltail 188]$ cat fid.txt
  0.999577738717693       8.112048998602431E-005

Putting it all Together

The blcr_wrapper will perform a “change directory” to $MYSANSCRATCH which is /sanscratch/JOBPID. So think in those terms. Copy the application (and input data) to '.' using $pre_cmd in the script. If $save_exec=“n”, then upon a restart the script will copy the application back.

You edit the top part of blcr_wrapper to match your job's needs. Then either the block of START or RESTART is uncommented. When you restart a job (some new JOBPID assigned by cluster) the script needs the old JOBPID of crashed job that has latest checkpoint file in /sanscratch/checkpoints/

Then submit to scheduler as usual

[hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial

Files v0.2

  • blcr_wrapper_serial at /home/hmeij/jobs/blcr/ for non MPI jobs
#!/bin/bash 
# work dir and cwd
export MYSANSCRATCH=/sanscratch/$LSB_JOBID
cd $MYSANSCRATCH

# at job finish, all content in /sanscratch/JOBPID
# will be copied to /sanscratch/checkpoints/JOBPID
# content older than 3 months will be removed

# SCHEDULER 
#BSUB -q test
#BSUB -n 1
#BSUB -J test
#BSUB -o out
#BSUB -e err

# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d 
cpti=10m

# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
# always stage the application (and data if needed)
# if application is large and static save_exec="n"
save_exec="n"
pre_cmd=" scp $HOME/ynam/a.out . "
post_cmd=" scp $MYSANSCRATCH/fid.txt $HOME/ynam "

# IF START OF JOB, UNCOMMENT
# its either start or restart block
mode=start
queue=test
cmd="./a.out"

# IF RESTART OF JOB, UNCOMMENT
# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
#mode=restart
#queue=test
#orgjobpid=250

# buglines: if your group/lab is mentioned set value to "y", else "n"
# "y" for rblumel/ynam
do_bug1a_cmd="y"
do_bug1b_cmd="y" 



############### NOTHING TO EDIT BELOW THIS LINE ##################



# checkpoints
checkpoints=/sanscratch/checkpoints

# bug commands
bug1a_cmd="scp $MYSANSCRATCH/fid.txt $checkpoints/$LSB_JOBID/"
bug1b_cmd="scp $checkpoints/$orgjobpid/fid.txt $MYSANSCRATCH"

# kernel modules
mods=`lsmod | grep ^blcr | wc -l`
if [ $mods -ne 2 ]; then
        echo "Error: BLCR modules not loaded on `hostname`"
        kill $$
fi

# blcr setup
restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
if [ $save_exec == "n" ]; then
        save_options="--save-private --save-shared"
else
        save_options="--save-all"
fi
export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH

# setup checkpoints dir
if [ ! -d $checkpoints/$LSB_JOBID ]; then
        mkdir -p $checkpoints/$LSB_JOBID 
else
        echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
        kill $$
fi

# save process id and path and start application
if [ "$mode" == "start" ];  then
        $pre_cmd
        cr_run $cmd &
        pid=$!
        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
        orgjobpid=0
        if [ "X$do_bug1a_cmd" == "Xy" ]; then
                $bug1a_cmd
        fi

# otherwise restart the job
elif [ "$mode" == "restart" ]; then
        orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
        orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
        #if [ "X$orgpid" == "X" -o "X$orgpwd" == "X" ]; then
        #       echo "Error: problem with missing orgpid or orgpwd values"
        #       kill $$
        #fi
        scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
        if [ $save_exec == "n" ]; then
                $pre_cmd
        fi
        if [ "X$do_bug1b_cmd" == "Xy" ]; then
                $bug1b_cmd
        fi
        cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/$orgjobpid/chk.$orgpid &
        pid=$!
        pwd > $checkpoints/$LSB_JOBID/pwd.$pid

# obviously
else
        echo "Error: startup mode not defined correctly"
        kill $$
fi

# if $cmd disappears during $pcit, terminate wrapper
export POST_CMD="$post_cmd"
blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &

# always run this block
while [ true ]; do
        # checkpoint time interval
        sleep $cpti
        # checkpoint file outside of sanscratch
        cr_checkpoint $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid
        scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
        if [ "X$do_bug1a_cmd" == "Xy" ]; then
                $bug1a_cmd
        fi
done
  • blcr_watcher v01 at /share/apps/bin/
#!/bin/bash

# watch a process during check point time interval
# if it disappears, terminate the blcr_wrapper

checkpoints=/sanscratch/checkpoints

watch_pid=$1
watch_wrapper=$2
jobpid=$3
orgjobpid=$4

while [ $watch_pid -gt 0 ]; do
        sleep 600
        nopid=`ps -u $USER | grep $watch_pid | awk '{print $1}'`
        if [ "${nopid}x" == 'x' ]; then
                # save output
                scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
                if [ $orgjobpid -gt 0 ]; then
                        rm -f $HOME/.lsbatch/*.$orgjobpid.err $HOME/.lsbatch/*.$orgjobpid.out 
                fi
                $POST_CMD
                kill $watch_wrapper
                exit;
        fi
done

Matlab

cluster/147.txt · Last modified: 2020/02/27 13:06 by hmeij07