User Tools

Site Tools


cluster:148

This is an old revision of the document!



Back

BLCR Checkpoint in OL3

  • This page concerns PARALLEL mpirun jobs only; there are some restrictions
    • all MPI threads need to be confined to one node
    • restarted jobs must use the same node (not sure why)
  • Installation and what it does BLCR

Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of mpirun and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files. And use the old hostfile.

The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special “older” version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI.

Here is the admin stuff.

# from eric at lbl, configure openmpi, I choose 1.6.5 (version needs to be < 1.7)
./configure \
            --enable-ft-thread \
            --with-ft=cr \
            --enable-opal-multi-threads \
            --with-blcr=/share/apps/blcr/0.8.5/test \
            --without-tm \
            --prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr
make
make install

# next download cr_mpirun from LBL
https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz

# configure and test cr_mpirun
export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH

./configure --with-blcr=/share/apps/blcr/0.8.5/test
make 
make check

============================================================================
Testsuite summary for cr_mpirun 295
============================================================================
# TOTAL: 3
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295'

# I copied cr_mpirun into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin/
# cr_mpirun needs access to all these in $PATH
# mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart

# next compile you parallel software using mpicc/mpicxx from the 1.6.5 distro

Here is what a sample run using the Openlava scheduler looks like

# submit as usual after editing the top of the file, see comments in that wrapper file
[hmeij@cottontail lammps]$ bsub < blcr_wrapper_parallel

Job <681> is submitted to queue <test>.

# cr_mpirun job
[hmeij@cottontail lammps]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
681     hmeij   RUN   test       cottontail  petaltail   test       Mar 29 13:23
                                             petaltail
                                             petaltail
                                             petaltail

# wrapper stores BLCR checkpoint file (chk.PID) in this location
# and it calls the openmpi snapshot tools and stores that in
# ompi_global_snapshot_SOME-PID.ckpt, also in same location

[hmeij@cottontail lammps]$ ll /sanscratch/checkpoints/681
total 30572                                              
-rw------- 1 hmeij its     8704 Mar 29 13:28 1459272204.681.err
-rw------- 1 hmeij its     5686 Mar 29 13:28 1459272204.681.out
-rw-r--r-- 1 hmeij its     2652 Mar 29 13:28 au.inp
-rw-r--r-- 1 hmeij its        0 Mar 29 13:28 auout
-rw-r--r-- 1 hmeij its    38310 Mar 29 13:28 auu3
-r-------- 1 hmeij its   289714 Mar 29 13:28 chk.9127
-rw-r--r-- 1 hmeij its 21342187 Mar 29 13:28 data.Big11AuSAMInitial
-rw-r--r-- 1 hmeij its  9598629 Mar 29 13:28 henz.dump
drwx------ 3 hmeij its       46 Mar 29 13:28 ompi_global_snapshot_9134.ckpt
-rw-r--r-- 1 hmeij its       16 Mar 29 13:23 pwd.9127

# the processes running
[hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
  PID TTY          TIME CMD
 5762 ?        00:00:00 sshd
 5763 pts/1    00:00:00 bash
 9104 ?        00:00:00 res
 9110 ?        00:00:00 1459272204.681
 9113 ?        00:00:00 1459272204.681.
 9127 ?        00:00:00 cr_mpirun
 9128 ?        00:00:00 blcr_watcher <--- the watcher, will terminate 9113 if 9127 is gone (ie done or crashed)
 9133 ?        00:00:00 cr_mpirun
 9134 ?        00:00:00 mpirun
 9135 ?        00:00:00 sleep
 9136 ?        00:05:55 lmp_mpi
 9137 ?        00:06:07 lmp_mpi
 9138 ?        00:05:52 lmp_mpi
 9139 ?        00:05:53 lmp_mpi
 9347 ?        00:00:00 sleep
 9369 ?        00:00:00 sshd
 9370 ?        00:00:00 ps
18559 pts/2    00:00:00 bash

# how far did the job progress?
[hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out
     190    5007151.9   -4440573.2 1.1211925e+08 2.8818913e+08    22081.756
     200    5279740.3     -4440516 1.0229219e+08 2.8818909e+08    23386.523
     210    5527023.5   -4440257.5     93377214 2.8818906e+08    24569.109
     220    5747387.7   -4439813.3     85432510 2.8818904e+08    25621.734
     230    5939773.3   -4439214.4     78496309 2.8818904e+08    26539.282
     240    6103647.2   -4438507.6     72587871 2.8818905e+08    27319.145
     250    6238961.8   -4437755.5     67708974 2.8818907e+08    27961.064
     260    6346104.5   -4437033.7     63845731 2.881891e+08    28466.852
     270    6425840.1   -4436426.7     60970646 2.8818913e+08    28840.129
     280    6479251.9   -4436021.1     59044759 2.8818917e+08    29086.075
     
# simulate crash
[hmeij@cottontail lammps]$ ssh petaltail kill 9133

# edit the file and prep for a restart, submit again
[hmeij@cottontail lammps]$ bsub < blcr_wrapper
Job <684> is submitted to queue <test>.       

# so job 684 is resatrinig job 681, wrapper preps files                        
[hmeij@cottontail lammps]$ ll ../.lsbatch/                                      
total 172                                                                       
-rw------- 1 hmeij its 8589 Mar 29 13:48 1459272204.681.err                     
-rw------- 1 hmeij its 5686 Mar 29 13:48 1459272204.681.out                     
-rwx------ 1 hmeij its 4609 Mar 29 13:48 1459273700.684                         
-rw------- 1 hmeij its 9054 Mar 29 13:48 1459273700.684.err                     
-rw------- 1 hmeij its   53 Mar 29 13:48 1459273700.684.out                     
-rwxr--r-- 1 hmeij its 4270 Mar 29 13:48 1459273700.684.shell                   
-rwxr--r-- 1 hmeij its   33 Mar 29 13:48 hostfile.681 
-rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.684                                     
-rw-r--r-- 1 hmeij its   40 Mar 29 13:48 hostfile.tmp.684                                 

[hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
  PID TTY          TIME CMD                         
 5762 ?        00:00:00 sshd                        
 5763 pts/1    00:00:00 bash                        
 9127 ?        00:00:00 cr_mpirun                   
 9136 ?        00:00:34 lmp_mpi                     
 9137 ?        00:00:34 lmp_mpi                     
 9138 ?        00:00:34 lmp_mpi                     
 9139 ?        00:00:34 lmp_mpi                     
 9994 ?        00:00:00 res                         
10002 ?        00:00:00 1459273700.684              
10005 ?        00:00:00 1459273700.684.             
10039 ?        00:00:00 cr_restart    <------ started everything back up              
10051 ?        00:00:00 cr_mpirun                   
10052 ?        00:00:00 mpirun                      
10053 ?        00:00:00 blcr_watcher                
10054 ?        00:00:00 sleep                       
10055 ?        00:00:00 sleep                       
10056 ?        00:00:01 cr_restart                  
10057 ?        00:00:01 cr_restart                  
10058 ?        00:00:02 cr_restart                  
10059 ?        00:00:02 cr_restart                  
10151 ?        00:00:00 sshd                        
10152 ?        00:00:00 ps                          
18559 pts/2    00:00:00 bash                        

# and now you can watch the output picking from last checkpoint file 
[hmeij@cottontail lammps]$ tail -20 ../.lsbatch/1459272204.681.out
     210    5527023.5   -4440257.5     93377214 2.8818906e+08    24569.109
     220    5747387.7   -4439813.3     85432510 2.8818904e+08    25621.734
     230    5939773.3   -4439214.4     78496309 2.8818904e+08    26539.282
     240    6103647.2   -4438507.6     72587871 2.8818905e+08    27319.145
     250    6238961.8   -4437755.5     67708974 2.8818907e+08    27961.064
     260    6346104.5   -4437033.7     63845731 2.881891e+08    28466.852
     270    6425840.1   -4436426.7     60970646 2.8818913e+08    28840.129
     280    6479251.9   -4436021.1     59044759 2.8818917e+08    29086.075
     290      6507681   -4435898.2     58019799 2.8818922e+08    29211.089
     300      6512669   -4436124.7     57840251 2.8818927e+08    29222.575
     310    6495904.7   -4436745.3     58445285 2.8818932e+08    29128.647
     320    6459174.9   -4437776.1     59770495 2.8818937e+08     28937.93
     330    6404322.4   -4439201.5     61749434 2.8818942e+08    28659.348
     340      6333209   -4440973.8     64314930 2.8818947e+08    28301.927
     350    6247685.4   -4443016.1     67400192 2.8818951e+08    27874.684
     360    6149565.9   -4445228.2     70939709 2.8818956e+08    27386.465
     370    6040609.2   -4447492.8     74869965 2.8818961e+08    26845.871
     380    5922503.2   -4449683.5     79129981 2.8818965e+08    26261.166
     390    5796854.1   -4451671.6     83661722 2.8818969e+08    25640.235
     400    5665179.3     -4453332     88410367 2.8818972e+08    24990.519

# let job finish

Parallel Wrapper

#!/bin/bash 
rm -f err out
# work dir and cwd
export MYSANSCRATCH=/sanscratch/$LSB_JOBID
cd $MYSANSCRATCH

# at job finish, all content in /sanscratch/JOBPID
# will be copied to /sanscratch/checkpoints/JOBPID
# content older than 3 months will be removed

# SCHEDULER 
#BSUB -q test
#BSUB -J test
#BSUB -n 4
#BSUB -o out
#BSUB -e err
# next required for mpirun checkpoint to work
# restarts must use same node (not sure why)
#BSUB -R "span[hosts=1]"

# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d 
cpti=1d

# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
# always stage the application (and data if needed)
# if mpirun save_exec="n" (default)
save_exec="n"
pre_cmd=" scp -r  $HOME/lammps/au.inp $HOME/lammps/auu3 
 $HOME/lammps/henz.dump  $HOME/lammps/data.Big11AuSAMInitial  . "
post_cmd=" scp auout $HOME/lammps/auout.$LSB_JOBID "

# IF START OF JOB, UNCOMMENT
# its either start or restart block
mode=start
queue=test
cmd=" lmp_mpi -c off -var GPUIDX 0 -in au.inp -l auout "

# IF RESTART OF JOB, UNCOMMENT
# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
#mode=restart
#queue=test
#orgjobpid=691

# user environment
export PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/lib:$LD_LIBRARY_PATH
#which lmp_mpi



############### NOTHING TO EDIT BELOW THIS LINE ##################



# checkpoints
checkpoints=/sanscratch/checkpoints

# kernel modules
mods=`/sbin/lsmod | grep ^blcr | wc -l`
if [ $mods -ne 2 ]; then
        echo "Error: BLCR modules not loaded on `hostname`"
        kill $$
fi

# blcr setup
restore_options=""
#restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
if [ $save_exec == "n" ]; then
        #save_options="--save-private --save-shared"
        save_options="--save-none"
else
        save_options="--save-all"
fi

# environment 
export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH

export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH

#which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart

# setup checkpoints dir
if [ ! -d $checkpoints/$LSB_JOBID ]; then
        mkdir -p $checkpoints/$LSB_JOBID 
else
        echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
        kill $$
fi

# save process id and path and start application
if [ "$mode" == "start" ];  then
        # hostfile
        echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID
        tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID
        $pre_cmd
        cr_mpirun -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \
                  --hostfile $HOME/.lsbatch/hostfile.$LSB_JOBID $cmd &
        pid=$!
        pwd > $checkpoints/$LSB_JOBID/pwd.$pid
        orgjobpid=0

# otherwise restart the job
elif [ "$mode" == "restart" ]; then
        orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
        orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
        # cleanup old
        rm -rf /sanscratch/$orgjobpid $HOME/.lsbatch/*.$orgjobpid
        # stage old
        scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
        scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH
        ln -s $MYSANSCRATCH /sanscratch/$orgjobpid
        scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/
        cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH $MYSANSCRATCH/chk.$orgpid &
        pid=$!
        started=` ps -u hmeij | awk '{print $1}' | grep $pid | wc -l`
        if [ $started -ne 1 ]; then
                echo "Error: cr_restart failed, check error log"
                kill $$
        fi
        pwd > $checkpoints/$LSB_JOBID/pwd.$pid

# obviously
else
        echo "Error: startup mode not defined correctly"
        kill $$
fi

# if $cmd disappears during $pcit, terminate wrapper
export POST_CMD="$post_cmd"
blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &

# always run this block
while [ true ]; do
        # checkpoint time interval
        sleep $cpti
        # checkpoint file outside of sanscratch
        scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
        cr_checkpoint --tree $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid
        scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
        scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/
done


Back

cluster/148.1459363761.txt.gz · Last modified: 2016/03/30 14:49 (external edit)