Deprecated since we did OS Update
We will replace it with DMTCP
— Henk 2020/01/14 14:31
Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of mpirun
and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files.
The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special “older” version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI.
Here is the admin stuff.
# from eric at lbl, configure openmpi, I choose 1.6.5 (version needs to be < 1.7) ./configure \ --enable-ft-thread \ --with-ft=cr \ --enable-opal-multi-threads \ --with-blcr=/share/apps/blcr/0.8.5/test \ --without-tm \ --prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr make make install # next download cr_mpirun from LBL https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz # configure and test cr_mpirun export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH ./configure --with-blcr=/share/apps/blcr/0.8.5/test make make check ============================================================================ Testsuite summary for cr_mpirun 295 ============================================================================ # TOTAL: 3 # PASS: 3 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 ============================================================================ make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295' # I copied cr_mpirun into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin/ # cr_mpirun needs access to all these in $PATH # mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart # next compile you parallel software using mpicc/mpicxx from the 1.6.5 distro
Here is what a sample run using the Openlava scheduler looks like
# submit as usual after editing the top of the file, see comments in that wrapper file [hmeij@cottontail lammps]$ bsub < blcr_wrapper_parallel Job <681> is submitted to queue <test>. # cr_mpirun job [hmeij@cottontail lammps]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 681 hmeij RUN test cottontail petaltail test Mar 29 13:23 petaltail petaltail petaltail # wrapper stores BLCR checkpoint file (chk.PID) in this location # and it calls the openmpi snapshot tools and stores that in # ompi_global_snapshot_SOME-PID.ckpt, also in same location [hmeij@cottontail lammps]$ ll /sanscratch/checkpoints/681 total 30572 -rw------- 1 hmeij its 8704 Mar 29 13:28 1459272204.681.err -rw------- 1 hmeij its 5686 Mar 29 13:28 1459272204.681.out -rw-r--r-- 1 hmeij its 2652 Mar 29 13:28 au.inp -rw-r--r-- 1 hmeij its 0 Mar 29 13:28 auout -rw-r--r-- 1 hmeij its 38310 Mar 29 13:28 auu3 -r-------- 1 hmeij its 289714 Mar 29 13:28 chk.9127 -rw-r--r-- 1 hmeij its 21342187 Mar 29 13:28 data.Big11AuSAMInitial -rw-r--r-- 1 hmeij its 9598629 Mar 29 13:28 henz.dump drwx------ 3 hmeij its 46 Mar 29 13:28 ompi_global_snapshot_9134.ckpt -rw-r--r-- 1 hmeij its 16 Mar 29 13:23 pwd.9127 # the processes running [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij PID TTY TIME CMD 5762 ? 00:00:00 sshd 5763 pts/1 00:00:00 bash 9104 ? 00:00:00 res 9110 ? 00:00:00 1459272204.681 9113 ? 00:00:00 1459272204.681. 9127 ? 00:00:00 cr_mpirun 9128 ? 00:00:00 blcr_watcher <--- the watcher, will terminate 9113 if 9127 is gone (ie done or crashed) 9133 ? 00:00:00 cr_mpirun 9134 ? 00:00:00 mpirun 9135 ? 00:00:00 sleep 9136 ? 00:05:55 lmp_mpi 9137 ? 00:06:07 lmp_mpi 9138 ? 00:05:52 lmp_mpi 9139 ? 00:05:53 lmp_mpi 9347 ? 00:00:00 sleep 9369 ? 00:00:00 sshd 9370 ? 00:00:00 ps 18559 pts/2 00:00:00 bash # how far did the job progress? [hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out 190 5007151.9 -4440573.2 1.1211925e+08 2.8818913e+08 22081.756 200 5279740.3 -4440516 1.0229219e+08 2.8818909e+08 23386.523 210 5527023.5 -4440257.5 93377214 2.8818906e+08 24569.109 220 5747387.7 -4439813.3 85432510 2.8818904e+08 25621.734 230 5939773.3 -4439214.4 78496309 2.8818904e+08 26539.282 240 6103647.2 -4438507.6 72587871 2.8818905e+08 27319.145 250 6238961.8 -4437755.5 67708974 2.8818907e+08 27961.064 260 6346104.5 -4437033.7 63845731 2.881891e+08 28466.852 270 6425840.1 -4436426.7 60970646 2.8818913e+08 28840.129 280 6479251.9 -4436021.1 59044759 2.8818917e+08 29086.075 # simulate crash [hmeij@cottontail lammps]$ ssh petaltail kill 9133 # edit the file and prep for a restart, submit again [hmeij@cottontail lammps]$ bsub < blcr_wrapper Job <684> is submitted to queue <test>. # so job 684 is resatrinig job 681, wrapper preps files [hmeij@cottontail lammps]$ ll ../.lsbatch/ total 172 -rw------- 1 hmeij its 8589 Mar 29 13:48 1459272204.681.err -rw------- 1 hmeij its 5686 Mar 29 13:48 1459272204.681.out -rwx------ 1 hmeij its 4609 Mar 29 13:48 1459273700.684 -rw------- 1 hmeij its 9054 Mar 29 13:48 1459273700.684.err -rw------- 1 hmeij its 53 Mar 29 13:48 1459273700.684.out -rwxr--r-- 1 hmeij its 4270 Mar 29 13:48 1459273700.684.shell -rwxr--r-- 1 hmeij its 33 Mar 29 13:48 hostfile.681 -rw-r--r-- 1 hmeij its 40 Mar 29 13:48 hostfile.684 -rw-r--r-- 1 hmeij its 40 Mar 29 13:48 hostfile.tmp.684 [hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij PID TTY TIME CMD 5762 ? 00:00:00 sshd 5763 pts/1 00:00:00 bash 9127 ? 00:00:00 cr_mpirun 9136 ? 00:00:34 lmp_mpi 9137 ? 00:00:34 lmp_mpi 9138 ? 00:00:34 lmp_mpi 9139 ? 00:00:34 lmp_mpi 9994 ? 00:00:00 res 10002 ? 00:00:00 1459273700.684 10005 ? 00:00:00 1459273700.684. 10039 ? 00:00:00 cr_restart <------ started everything back up 10051 ? 00:00:00 cr_mpirun 10052 ? 00:00:00 mpirun 10053 ? 00:00:00 blcr_watcher 10054 ? 00:00:00 sleep 10055 ? 00:00:00 sleep 10056 ? 00:00:01 cr_restart 10057 ? 00:00:01 cr_restart 10058 ? 00:00:02 cr_restart 10059 ? 00:00:02 cr_restart 10151 ? 00:00:00 sshd 10152 ? 00:00:00 ps 18559 pts/2 00:00:00 bash # and now you can watch the output picking from last checkpoint file [hmeij@cottontail lammps]$ tail -20 ../.lsbatch/1459272204.681.out 210 5527023.5 -4440257.5 93377214 2.8818906e+08 24569.109 220 5747387.7 -4439813.3 85432510 2.8818904e+08 25621.734 230 5939773.3 -4439214.4 78496309 2.8818904e+08 26539.282 240 6103647.2 -4438507.6 72587871 2.8818905e+08 27319.145 250 6238961.8 -4437755.5 67708974 2.8818907e+08 27961.064 260 6346104.5 -4437033.7 63845731 2.881891e+08 28466.852 270 6425840.1 -4436426.7 60970646 2.8818913e+08 28840.129 280 6479251.9 -4436021.1 59044759 2.8818917e+08 29086.075 290 6507681 -4435898.2 58019799 2.8818922e+08 29211.089 300 6512669 -4436124.7 57840251 2.8818927e+08 29222.575 310 6495904.7 -4436745.3 58445285 2.8818932e+08 29128.647 320 6459174.9 -4437776.1 59770495 2.8818937e+08 28937.93 330 6404322.4 -4439201.5 61749434 2.8818942e+08 28659.348 340 6333209 -4440973.8 64314930 2.8818947e+08 28301.927 350 6247685.4 -4443016.1 67400192 2.8818951e+08 27874.684 360 6149565.9 -4445228.2 70939709 2.8818956e+08 27386.465 370 6040609.2 -4447492.8 74869965 2.8818961e+08 26845.871 380 5922503.2 -4449683.5 79129981 2.8818965e+08 26261.166 390 5796854.1 -4451671.6 83661722 2.8818969e+08 25640.235 400 5665179.3 -4453332 88410367 2.8818972e+08 24990.519 # let job finish
A bit more verbose and error handling. Also the blcr_wrapper or the cr_checkpoint loop code can now terminate the job.
#!/bin/bash -x rm -f err out # work dir and cwd export MYSANSCRATCH=/sanscratch/$LSB_JOBID cd $MYSANSCRATCH # at job finish, all content in /sanscratch/JOBPID # will be copied to /sanscratch/checkpoints/JOBPID # content older than 3 months will be removed # SCHEDULER set queue name in next TWO lines queue=hp12 #BSUB -q hp12 #BSUB -n 6 #BSUB -J test #BSUB -o out #BSUB -e err # next required for mpirun checkpoint to work # restarts must use same node (not sure why) #BSUB -R "span[hosts=1]" #BSUB -m n5 # CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d cpti=15m # COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) # always stage the application (and data if needed) # if mpirun save_exec="n" (default) save_exec="n" pre_cmd=" scp -r $HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis $HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis.fits $HOME/python/kflaherty/data/lowres_ALMA_weights_calc.sav $HOME/python/kflaherty/co.dat $HOME/python/kflaherty/disk_other.py $HOME/python/kflaherty/disk.py $HOME/python/kflaherty/mol_dat.py $HOME/python/kflaherty/mpi_run_models.py $HOME/python/kflaherty/sample_co32.sh $HOME/python/kflaherty/single_model.py . " post_cmd=" scp $MYSANSCRATCH/chain*.dat $HOME/tmp/" # IF START OF JOB, UNCOMMENT # its either start or restart block #mode=start #cmd=" python mpi_run_models.py /sanscratch/$LSB_JOBID > /sanscratch/$LSB_JOBID/test2.out " # IF RESTART OF JOB, UNCOMMENT, MUST BE RUN ON SAME NODE # you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ mode=restart orgjobpid=636341 # user environment export PYTHONHOME=/share/apps/CENTOS6/blcr_soft/python/2.7.10 export PYTHONPATH=/home/apps/CENTOS6/blcr_soft/python/2.7.10/lib/python2.7/site-packages export PATH=$PYTHONHOME/bin:$PATH . /home/apps/miriad/MIRRC.sh export PATH=$MIRBIN:$PATH which python ############### NOTHING TO EDIT BELOW THIS LINE ################## # checkpoints checkpoints=/sanscratch/checkpoints # kernel modules mods=`/sbin/lsmod | grep ^blcr | wc -l` if [ $mods -ne 2 ]; then echo "Error: BLCR modules not loaded on `hostname`" kill $$ fi # blcr setup restore_options="" #restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid" if [ $save_exec == "n" ]; then #save_options="--save-private --save-shared" save_options="--save-none" else save_options="--save-all" fi # environment export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart # setup checkpoints dir if [ ! -d $checkpoints/$LSB_JOBID ]; then mkdir -p $checkpoints/$LSB_JOBID else echo "Error: $checkpoints/$LSB_JOBID already exists, exciting" kill $$ fi # save process id and path and start application if [ "$mode" == "start" ]; then # hostfile echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID c=`wc -l $HOME/.lsbatch/hostfile.$LSB_JOBID | awk '{print $1}'` for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$LSB_JOBID; done $pre_cmd # why rm -f /tmp/tmp?????? cr_mpirun -v -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \ -x LD_LIBRARY_PATH --hostfile $HOME/.lsbatch/localhost.$LSB_JOBID $cmd 2>>$checkpoints/$LSB_JOBID/cr_mpirun.err & pid=$! pwd > $checkpoints/$LSB_JOBID/pwd.$pid orgjobpid=0 # otherwise restart the job elif [ "$mode" == "restart" ]; then orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'` orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid` if [ "X$orgpwd" == "X" ]; then echo "Error: orgpwd problem, check error log" exit fi # cleanup old if present rm -rf /sanscratch/$orgjobpid /localscratch/$orgjobpid rm -f $HOME/.lsbatch/*.$orgjobpid # why rm -f /tmp/tmp?????? # stage old scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/ scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH ln -s $MYSANSCRATCH /sanscratch/$orgjobpid scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/ scp -r $checkpoints/$orgjobpid/$orgjobpid/* /localscratch/$LSB_JOBID # why scp $checkpoints/$orgjobpid/$orgjobpid/tmp?????? /tmp/ ln -s /localscratch/$LSB_JOBID /localscratch/$orgjobpid c=`wc -l $HOME/.lsbatch/hostfile.$orgjobpid | awk '{print $1}'` for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$orgjobpid; done cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH --cont \ $MYSANSCRATCH/chk.$orgpid 2>>$checkpoints/$LSB_JOBID/cr_restart.err & pid=$! started=`ps -u $USER | awk '{print $1}' | grep $pid | wc -l` if [ $started -ne 1 ]; then echo "Error: cr_restart failed, check error log" kill $$ fi pwd > $checkpoints/$LSB_JOBID/pwd.$pid # obviously else echo "Error: startup mode not defined correctly" kill $$ fi # if $cmd disappears during $pcit, terminate wrapper export POST_CMD="$post_cmd" blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & bw_pid=$! # always run this block while [ true ]; do # checkpoint time interval sleep $cpti # finished? no_pid=`ps -u $USER | grep $pid | awk '{print $1}'` if [ "${no_pid}x" == 'x' ]; then # save output scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/ $POST_CMD kill $bw_pid rm -f $HOME/.lsbatch/*${orgjobpid}* exit fi # checkpoint file outside of sanscratch scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/ scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/ chmod u+w $checkpoints/$LSB_JOBID/chk.* /sanscratch/$LSB_JOBID/chk.* # why scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/ cr_checkpoint -v --tree --cont $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid \ 2>>$checkpoints/$LSB_JOBID/cr_checkpoint.err scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/ scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/ scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/ # why scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/ date >> $checkpoints/$LSB_JOBID/cr_checkpoint.err done
#!/bin/bash rm -f err out # work dir and cwd export MYSANSCRATCH=/sanscratch/$LSB_JOBID cd $MYSANSCRATCH # at job finish, all content in /sanscratch/JOBPID # will be copied to /sanscratch/checkpoints/JOBPID # content older than 3 months will be removed # SCHEDULER #BSUB -q test #BSUB -J test #BSUB -n 4 #BSUB -o out #BSUB -e err # next required for mpirun checkpoint to work # restarts must use same node in test queue (not sure why, others ca restart anywhere) #BSUB -R "span[hosts=1]" # CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d cpti=1d # COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) # always stage the application (and data if needed) # if mpirun save_exec="n" (default) save_exec="n" pre_cmd=" scp -r $HOME/lammps/au.inp $HOME/lammps/auu3 $HOME/lammps/henz.dump $HOME/lammps/data.Big11AuSAMInitial . " post_cmd=" scp auout $HOME/lammps/auout.$LSB_JOBID " # IF START OF JOB, UNCOMMENT # its either start or restart block mode=start queue=test cmd=" lmp_mpi -c off -var GPUIDX 0 -in au.inp -l auout " # IF RESTART OF JOB, UNCOMMENT # you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ #mode=restart #queue=test #orgjobpid=691 # user environment export PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/lib:$LD_LIBRARY_PATH #which lmp_mpi ############### NOTHING TO EDIT BELOW THIS LINE ################## # checkpoints checkpoints=/sanscratch/checkpoints # kernel modules mods=`/sbin/lsmod | grep ^blcr | wc -l` if [ $mods -ne 2 ]; then echo "Error: BLCR modules not loaded on `hostname`" kill $$ fi # blcr setup restore_options="" #restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid" if [ $save_exec == "n" ]; then #save_options="--save-private --save-shared" save_options="--save-none" else save_options="--save-all" fi # environment export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH #which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart # setup checkpoints dir if [ ! -d $checkpoints/$LSB_JOBID ]; then mkdir -p $checkpoints/$LSB_JOBID else echo "Error: $checkpoints/$LSB_JOBID already exists, exciting" kill $$ fi # save process id and path and start application if [ "$mode" == "start" ]; then # hostfile echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID $pre_cmd cr_mpirun -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \ --hostfile $HOME/.lsbatch/hostfile.$LSB_JOBID $cmd & pid=$! pwd > $checkpoints/$LSB_JOBID/pwd.$pid orgjobpid=0 # otherwise restart the job elif [ "$mode" == "restart" ]; then orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'` orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid` # cleanup old rm -rf /sanscratch/$orgjobpid $HOME/.lsbatch/*.$orgjobpid # stage old scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/ scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH ln -s $MYSANSCRATCH /sanscratch/$orgjobpid scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/ cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH $MYSANSCRATCH/chk.$orgpid & pid=$! started=` ps -u hmeij | awk '{print $1}' | grep $pid | wc -l` if [ $started -ne 1 ]; then echo "Error: cr_restart failed, check error log" kill $$ fi pwd > $checkpoints/$LSB_JOBID/pwd.$pid # obviously else echo "Error: startup mode not defined correctly" kill $$ fi # if $cmd disappears during $pcit, terminate wrapper export POST_CMD="$post_cmd" blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & # always run this block while [ true ]; do # checkpoint time interval sleep $cpti # checkpoint file outside of sanscratch scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/ cr_checkpoint --tree $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/ scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/ done