User Tools

Site Tools


cluster:124


Back

Queue tinymem supports BLCR — Henk 2016/03/03 13:57

Adjust your PATH and LD_LIBRARY_PATH accordingly.

BLCR

So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress. To do that, we need to learn to checkpoint at the application level. If a node crashes or power is lost, those applications can then restart the job from the last checkpoint.

I've decided to support one checkpoint/restart utility, The Berkeley Laboratory Checkpoint/Restart tool. Hence this page.

BLCR consists of two kernel modules, some user-level libraries, and several command-line executables. No kernel patching is required. Modules are loading upon boot via /etc/rc.local. The modules are dependent on the kernel source where the compilation took place. So for our first supported BLRC modules I've chosen the mw256 queue and nodes. Here is some documentation on BLCR

First lets test on a node to grasp the concept.

# are modules loaded
[hmeij@n33 blcr]$ lsmod | grep blcr
blcr                  115529  0
blcr_imports           10715  1 blcr


# set env
export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH

# is it all working
[hmeij@n33 blcr]$ cr_checkpoint --help
Usage: cr_checkpoint [options] ID           

Options:
General options:
  -v, --verbose          print progress messages to stderr.
  -q, --quiet            suppress error/warning messages to stderr.
  -?, --help             print this message and exit.              
      --version          print version information and exit.  
...

# and here is our application and output (one extra character per second)
[hmeij@n33 blcr]$ ./t-20001030-01
*
**
***
****
*****
******
...

So now lets run this under BLCR and observe what happens. After we define the proper environment, we use cr_run to launch our application. Standard output and error are written into the output file context. We then observe the PID of our process and use cr_checkpoint to write a checkpoint file and immediately terminate the process.

# start application
[hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 &
[1] 12789

# observe PID
[hmeij@n33 blcr]$ ps
  PID TTY          TIME CMD
12789 pts/29   00:00:00 t-20001030-01
12817 pts/29   00:00:00 ps
28257 pts/29   00:00:00 bash

# wait, then checkpoint and terminate process
[hmeij@n33 blcr]$ sleep 30
[hmeij@n33 blcr]$ cr_checkpoint --term 12789
[1]+  Terminated              cr_run ./t-20001030-01 > context 2>&1

# save the output
[hmeij@n33 blcr]$ mv context context.save

Ok. Next we use cr_restart to restart our application by pointing it to the checkpoint file generated. Then we'll wait a bit and terminate the restart.

# restart in background
[hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 &
[1] 13579

# wait and terminate the restart
[hmeij@n33 blcr]$ sleep 30
[hmeij@n33 blcr]$ kill %1
[1]+  Terminated              cr_restart ./context.12789 > context 2>&1

So what we're interested in is the boundary between first termination and subsequent restart. It alooks like this:

[hmeij@n33 blcr]$ tail context.save
*****************************************
******************************************
*******************************************
********************************************
*********************************************
**********************************************
***********************************************
************************************************
*************************************************
**************************************************
[hmeij@n33 blcr]$ head context
***************************************************
****************************************************
*****************************************************
******************************************************
*******************************************************
********************************************************
*********************************************************
**********************************************************
***********************************************************
************************************************************

# pretty nifty!
# but be forewarned that there are binary characters lurking at this boundary
# you can strip them out with ''sed'' or ''tr''
# it looks like this

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************

Now we can write a batch script for the scheduler. We need to do several things

  • The job will always end up in /sanscratch/JOBPID so we need to stage and save our data
  • The checkpoint file should be written to a safe place, like /home
  • The time interval for checkpointing should be sufficiently large to not slow the job down
    • for example set it to 12 hours or 24 hours even
    • the small interval times in script is just for testing
  • Then there are 2 blocks of line sto (un)comment
    • One to invoke cr_run
    • One to invoke cr_restart
  • For a restart we need tow things
    • Create a link from old working directory to new working directory (saved in the pwd text file)
    • And edit the script and change the comment blocks and edit the process_id
      • The restart job may end up on another node but will same process_id

After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to. To simulate a crash, while your first submission is running with cr_run you can simply find the node it is running on and the process ID (in the file *out) then issue the command ssh node_name kill process_id and wait for the next while iteration to terminate the program. The scheduler will think the job terminate fine (job status DONE). Or just issue a bkill command, you should be able to recover from it too.

It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals. I'm investigating that but in the meantime you can do it manually.

run.serial

#!/bin/bash 
# submit via 'bsub < run.serial'
rm -f *err *out *shell
#BSUB -q test
#BSUB -n 1
#BSUB -J test
#BSUB -o out
#BSUB -e err

export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH

# checkpoint file is defined in while loop
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
cd $MYSANSCRATCH

# stage the application (plus data if needed)
cp -rp ~/blcr/t-20001030-01 .

# on first start of application, remember the working directory
# save some stuff for checking later and restart
#cr_run ./t-20001030-01 > context 2>&1 &
#sleep 60
#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
#pwd > pwd.$process_id
#cp -p pwd* *.shell *.out *.err ~/blcr/

# on restart, give cr_restart some time to set up
# WARNING: it will overwrite the checkpoint file, save it
# you need to find the process_id and supply it
process_id=4711
cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
mv ~/blcr/context  ~/blcr/context.save
ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
sleep 60

# always uncommented
echo "process_id=$process_id"
while [ $process_id -gt 0 ]; do
        # checkpoint time interval, make it very large (small for testing)
        sleep 120
        # save the checkpoint file outside of sanscratch
        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
        cp -p context ~/blcr/
        # if the application has crashed, or finished, exit
        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
        if [ "${process_id}x" = 'x' ]; then
                rm -f `cat ~/blcr/pwd.$process_id`
                exit;
        fi 
done



Back

cluster/124.txt · Last modified: 2016/03/11 15:14 by hmeij07