\\
**[[cluster:0|Back]]**
==== BLCR Checkpoint in OL3 ====
**Deprecated since we did [[cluster:185|OS Update]] \\
We will replace it with [[cluster:190|DMTCP]] ** \\
--- //[[hmeij@wesleyan.edu|Henk]] 2020/01/14 14:31//
* This page concerns PARALLEL mpirun jobs only; there are some restrictions
* all MPI threads need to be confined to one node
* restarted jobs must use the same node (not sure why)
* For SERIAL jobs go here [[cluster:147|BLCR Checkpoint in OL3]]
* Installation and what it does [[cluster:124|BLCR]]
* Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]
Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files.
The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special "older" version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI.
Here is the admin stuff.
# from eric at lbl, configure openmpi, I choose 1.6.5 (version needs to be < 1.7)
./configure \
--enable-ft-thread \
--with-ft=cr \
--enable-opal-multi-threads \
--with-blcr=/share/apps/blcr/0.8.5/test \
--without-tm \
--prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr
make
make install
# next download cr_mpirun from LBL
https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz
# configure and test cr_mpirun
export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
./configure --with-blcr=/share/apps/blcr/0.8.5/test
make
make check
============================================================================
Testsuite summary for cr_mpirun 295
============================================================================
# TOTAL: 3
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295'
# I copied cr_mpirun into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin/
# cr_mpirun needs access to all these in $PATH
# mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
# next compile you parallel software using mpicc/mpicxx from the 1.6.5 distro
Here is what a sample run using the Openlava scheduler looks like
# submit as usual after editing the top of the file, see comments in that wrapper file
[hmeij@cottontail lammps]$ bsub < blcr_wrapper_parallel
Job <681> is submitted to queue .
# cr_mpirun job
[hmeij@cottontail lammps]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
681 hmeij RUN test cottontail petaltail test Mar 29 13:23
petaltail
petaltail
petaltail
# wrapper stores BLCR checkpoint file (chk.PID) in this location
# and it calls the openmpi snapshot tools and stores that in
# ompi_global_snapshot_SOME-PID.ckpt, also in same location
[hmeij@cottontail lammps]$ ll /sanscratch/checkpoints/681
total 30572
-rw------- 1 hmeij its 8704 Mar 29 13:28 1459272204.681.err
-rw------- 1 hmeij its 5686 Mar 29 13:28 1459272204.681.out
-rw-r--r-- 1 hmeij its 2652 Mar 29 13:28 au.inp
-rw-r--r-- 1 hmeij its 0 Mar 29 13:28 auout
-rw-r--r-- 1 hmeij its 38310 Mar 29 13:28 auu3
-r-------- 1 hmeij its 289714 Mar 29 13:28 chk.9127
-rw-r--r-- 1 hmeij its 21342187 Mar 29 13:28 data.Big11AuSAMInitial
-rw-r--r-- 1 hmeij its 9598629 Mar 29 13:28 henz.dump
drwx------ 3 hmeij its 46 Mar 29 13:28 ompi_global_snapshot_9134.ckpt
-rw-r--r-- 1 hmeij its 16 Mar 29 13:23 pwd.9127
# the processes running
[hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
PID TTY TIME CMD
5762 ? 00:00:00 sshd
5763 pts/1 00:00:00 bash
9104 ? 00:00:00 res
9110 ? 00:00:00 1459272204.681
9113 ? 00:00:00 1459272204.681.
9127 ? 00:00:00 cr_mpirun
9128 ? 00:00:00 blcr_watcher <--- the watcher, will terminate 9113 if 9127 is gone (ie done or crashed)
9133 ? 00:00:00 cr_mpirun
9134 ? 00:00:00 mpirun
9135 ? 00:00:00 sleep
9136 ? 00:05:55 lmp_mpi
9137 ? 00:06:07 lmp_mpi
9138 ? 00:05:52 lmp_mpi
9139 ? 00:05:53 lmp_mpi
9347 ? 00:00:00 sleep
9369 ? 00:00:00 sshd
9370 ? 00:00:00 ps
18559 pts/2 00:00:00 bash
# how far did the job progress?
[hmeij@cottontail lammps]$ tail ~/.lsbatch/1459272204.681.out
190 5007151.9 -4440573.2 1.1211925e+08 2.8818913e+08 22081.756
200 5279740.3 -4440516 1.0229219e+08 2.8818909e+08 23386.523
210 5527023.5 -4440257.5 93377214 2.8818906e+08 24569.109
220 5747387.7 -4439813.3 85432510 2.8818904e+08 25621.734
230 5939773.3 -4439214.4 78496309 2.8818904e+08 26539.282
240 6103647.2 -4438507.6 72587871 2.8818905e+08 27319.145
250 6238961.8 -4437755.5 67708974 2.8818907e+08 27961.064
260 6346104.5 -4437033.7 63845731 2.881891e+08 28466.852
270 6425840.1 -4436426.7 60970646 2.8818913e+08 28840.129
280 6479251.9 -4436021.1 59044759 2.8818917e+08 29086.075
# simulate crash
[hmeij@cottontail lammps]$ ssh petaltail kill 9133
# edit the file and prep for a restart, submit again
[hmeij@cottontail lammps]$ bsub < blcr_wrapper
Job <684> is submitted to queue .
# so job 684 is resatrinig job 681, wrapper preps files
[hmeij@cottontail lammps]$ ll ../.lsbatch/
total 172
-rw------- 1 hmeij its 8589 Mar 29 13:48 1459272204.681.err
-rw------- 1 hmeij its 5686 Mar 29 13:48 1459272204.681.out
-rwx------ 1 hmeij its 4609 Mar 29 13:48 1459273700.684
-rw------- 1 hmeij its 9054 Mar 29 13:48 1459273700.684.err
-rw------- 1 hmeij its 53 Mar 29 13:48 1459273700.684.out
-rwxr--r-- 1 hmeij its 4270 Mar 29 13:48 1459273700.684.shell
-rwxr--r-- 1 hmeij its 33 Mar 29 13:48 hostfile.681
-rw-r--r-- 1 hmeij its 40 Mar 29 13:48 hostfile.684
-rw-r--r-- 1 hmeij its 40 Mar 29 13:48 hostfile.tmp.684
[hmeij@cottontail lammps]$ ssh petaltail ps -u hmeij
PID TTY TIME CMD
5762 ? 00:00:00 sshd
5763 pts/1 00:00:00 bash
9127 ? 00:00:00 cr_mpirun
9136 ? 00:00:34 lmp_mpi
9137 ? 00:00:34 lmp_mpi
9138 ? 00:00:34 lmp_mpi
9139 ? 00:00:34 lmp_mpi
9994 ? 00:00:00 res
10002 ? 00:00:00 1459273700.684
10005 ? 00:00:00 1459273700.684.
10039 ? 00:00:00 cr_restart <------ started everything back up
10051 ? 00:00:00 cr_mpirun
10052 ? 00:00:00 mpirun
10053 ? 00:00:00 blcr_watcher
10054 ? 00:00:00 sleep
10055 ? 00:00:00 sleep
10056 ? 00:00:01 cr_restart
10057 ? 00:00:01 cr_restart
10058 ? 00:00:02 cr_restart
10059 ? 00:00:02 cr_restart
10151 ? 00:00:00 sshd
10152 ? 00:00:00 ps
18559 pts/2 00:00:00 bash
# and now you can watch the output picking from last checkpoint file
[hmeij@cottontail lammps]$ tail -20 ../.lsbatch/1459272204.681.out
210 5527023.5 -4440257.5 93377214 2.8818906e+08 24569.109
220 5747387.7 -4439813.3 85432510 2.8818904e+08 25621.734
230 5939773.3 -4439214.4 78496309 2.8818904e+08 26539.282
240 6103647.2 -4438507.6 72587871 2.8818905e+08 27319.145
250 6238961.8 -4437755.5 67708974 2.8818907e+08 27961.064
260 6346104.5 -4437033.7 63845731 2.881891e+08 28466.852
270 6425840.1 -4436426.7 60970646 2.8818913e+08 28840.129
280 6479251.9 -4436021.1 59044759 2.8818917e+08 29086.075
290 6507681 -4435898.2 58019799 2.8818922e+08 29211.089
300 6512669 -4436124.7 57840251 2.8818927e+08 29222.575
310 6495904.7 -4436745.3 58445285 2.8818932e+08 29128.647
320 6459174.9 -4437776.1 59770495 2.8818937e+08 28937.93
330 6404322.4 -4439201.5 61749434 2.8818942e+08 28659.348
340 6333209 -4440973.8 64314930 2.8818947e+08 28301.927
350 6247685.4 -4443016.1 67400192 2.8818951e+08 27874.684
360 6149565.9 -4445228.2 70939709 2.8818956e+08 27386.465
370 6040609.2 -4447492.8 74869965 2.8818961e+08 26845.871
380 5922503.2 -4449683.5 79129981 2.8818965e+08 26261.166
390 5796854.1 -4451671.6 83661722 2.8818969e+08 25640.235
400 5665179.3 -4453332 88410367 2.8818972e+08 24990.519
# let job finish
==== Parallel Wrapper v2 ====
A bit more verbose and error handling. Also the blcr_wrapper or the cr_checkpoint loop code can now terminate the job.
#!/bin/bash -x
rm -f err out
# work dir and cwd
export MYSANSCRATCH=/sanscratch/$LSB_JOBID
cd $MYSANSCRATCH
# at job finish, all content in /sanscratch/JOBPID
# will be copied to /sanscratch/checkpoints/JOBPID
# content older than 3 months will be removed
# SCHEDULER set queue name in next TWO lines
queue=hp12
#BSUB -q hp12
#BSUB -n 6
#BSUB -J test
#BSUB -o out
#BSUB -e err
# next required for mpirun checkpoint to work
# restarts must use same node (not sure why)
#BSUB -R "span[hosts=1]"
#BSUB -m n5
# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d
cpti=15m
# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
# always stage the application (and data if needed)
# if mpirun save_exec="n" (default)
save_exec="n"
pre_cmd=" scp -r
$HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis
$HOME/python/kflaherty/data/HD163296.CO32.regridded.cen15.vis.fits
$HOME/python/kflaherty/data/lowres_ALMA_weights_calc.sav
$HOME/python/kflaherty/co.dat
$HOME/python/kflaherty/disk_other.py
$HOME/python/kflaherty/disk.py
$HOME/python/kflaherty/mol_dat.py
$HOME/python/kflaherty/mpi_run_models.py
$HOME/python/kflaherty/sample_co32.sh
$HOME/python/kflaherty/single_model.py . "
post_cmd=" scp $MYSANSCRATCH/chain*.dat $HOME/tmp/"
# IF START OF JOB, UNCOMMENT
# its either start or restart block
#mode=start
#cmd=" python mpi_run_models.py /sanscratch/$LSB_JOBID > /sanscratch/$LSB_JOBID/test2.out "
# IF RESTART OF JOB, UNCOMMENT, MUST BE RUN ON SAME NODE
# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
mode=restart
orgjobpid=636341
# user environment
export PYTHONHOME=/share/apps/CENTOS6/blcr_soft/python/2.7.10
export PYTHONPATH=/home/apps/CENTOS6/blcr_soft/python/2.7.10/lib/python2.7/site-packages
export PATH=$PYTHONHOME/bin:$PATH
. /home/apps/miriad/MIRRC.sh
export PATH=$MIRBIN:$PATH
which python
############### NOTHING TO EDIT BELOW THIS LINE ##################
# checkpoints
checkpoints=/sanscratch/checkpoints
# kernel modules
mods=`/sbin/lsmod | grep ^blcr | wc -l`
if [ $mods -ne 2 ]; then
echo "Error: BLCR modules not loaded on `hostname`"
kill $$
fi
# blcr setup
restore_options=""
#restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
if [ $save_exec == "n" ]; then
#save_options="--save-private --save-shared"
save_options="--save-none"
else
save_options="--save-all"
fi
# environment
export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH
which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
# setup checkpoints dir
if [ ! -d $checkpoints/$LSB_JOBID ]; then
mkdir -p $checkpoints/$LSB_JOBID
else
echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
kill $$
fi
# save process id and path and start application
if [ "$mode" == "start" ]; then
# hostfile
echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID
tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID
c=`wc -l $HOME/.lsbatch/hostfile.$LSB_JOBID | awk '{print $1}'`
for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$LSB_JOBID; done
$pre_cmd
# why
rm -f /tmp/tmp??????
cr_mpirun -v -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \
-x LD_LIBRARY_PATH --hostfile $HOME/.lsbatch/localhost.$LSB_JOBID $cmd 2>>$checkpoints/$LSB_JOBID/cr_mpirun.err &
pid=$!
pwd > $checkpoints/$LSB_JOBID/pwd.$pid
orgjobpid=0
# otherwise restart the job
elif [ "$mode" == "restart" ]; then
orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
if [ "X$orgpwd" == "X" ]; then
echo "Error: orgpwd problem, check error log"
exit
fi
# cleanup old if present
rm -rf /sanscratch/$orgjobpid /localscratch/$orgjobpid
rm -f $HOME/.lsbatch/*.$orgjobpid
# why
rm -f /tmp/tmp??????
# stage old
scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH
ln -s $MYSANSCRATCH /sanscratch/$orgjobpid
scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/
scp -r $checkpoints/$orgjobpid/$orgjobpid/* /localscratch/$LSB_JOBID
# why
scp $checkpoints/$orgjobpid/$orgjobpid/tmp?????? /tmp/
ln -s /localscratch/$LSB_JOBID /localscratch/$orgjobpid
c=`wc -l $HOME/.lsbatch/hostfile.$orgjobpid | awk '{print $1}'`
for i in `seq 1 $c`; do echo '127.0.0.1' >> $HOME/.lsbatch/localhost.$orgjobpid; done
cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH --cont \
$MYSANSCRATCH/chk.$orgpid 2>>$checkpoints/$LSB_JOBID/cr_restart.err &
pid=$!
started=`ps -u $USER | awk '{print $1}' | grep $pid | wc -l`
if [ $started -ne 1 ]; then
echo "Error: cr_restart failed, check error log"
kill $$
fi
pwd > $checkpoints/$LSB_JOBID/pwd.$pid
# obviously
else
echo "Error: startup mode not defined correctly"
kill $$
fi
# if $cmd disappears during $pcit, terminate wrapper
export POST_CMD="$post_cmd"
blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &
bw_pid=$!
# always run this block
while [ true ]; do
# checkpoint time interval
sleep $cpti
# finished?
no_pid=`ps -u $USER | grep $pid | awk '{print $1}'`
if [ "${no_pid}x" == 'x' ]; then
# save output
scp -rp $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
$POST_CMD
kill $bw_pid
rm -f $HOME/.lsbatch/*${orgjobpid}*
exit
fi
# checkpoint file outside of sanscratch
scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/
chmod u+w $checkpoints/$LSB_JOBID/chk.* /sanscratch/$LSB_JOBID/chk.*
# why
scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/
cr_checkpoint -v --tree --cont $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid \
2>>$checkpoints/$LSB_JOBID/cr_checkpoint.err
scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/
scp -r /localscratch/$LSB_JOBID $checkpoints/$LSB_JOBID/
# why
scp /tmp/tmp?????? $checkpoints/$LSB_JOBID/$LSB_JOBID/
date >> $checkpoints/$LSB_JOBID/cr_checkpoint.err
done
==== Parallel Wrapper v1 ====
#!/bin/bash
rm -f err out
# work dir and cwd
export MYSANSCRATCH=/sanscratch/$LSB_JOBID
cd $MYSANSCRATCH
# at job finish, all content in /sanscratch/JOBPID
# will be copied to /sanscratch/checkpoints/JOBPID
# content older than 3 months will be removed
# SCHEDULER
#BSUB -q test
#BSUB -J test
#BSUB -n 4
#BSUB -o out
#BSUB -e err
# next required for mpirun checkpoint to work
# restarts must use same node in test queue (not sure why, others ca restart anywhere)
#BSUB -R "span[hosts=1]"
# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d
cpti=1d
# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd)
# always stage the application (and data if needed)
# if mpirun save_exec="n" (default)
save_exec="n"
pre_cmd=" scp -r $HOME/lammps/au.inp $HOME/lammps/auu3
$HOME/lammps/henz.dump $HOME/lammps/data.Big11AuSAMInitial . "
post_cmd=" scp auout $HOME/lammps/auout.$LSB_JOBID "
# IF START OF JOB, UNCOMMENT
# its either start or restart block
mode=start
queue=test
cmd=" lmp_mpi -c off -var GPUIDX 0 -in au.inp -l auout "
# IF RESTART OF JOB, UNCOMMENT
# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/
#mode=restart
#queue=test
#orgjobpid=691
# user environment
export PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/blcr_soft/lammps/16Feb16/lib:$LD_LIBRARY_PATH
#which lmp_mpi
############### NOTHING TO EDIT BELOW THIS LINE ##################
# checkpoints
checkpoints=/sanscratch/checkpoints
# kernel modules
mods=`/sbin/lsmod | grep ^blcr | wc -l`
if [ $mods -ne 2 ]; then
echo "Error: BLCR modules not loaded on `hostname`"
kill $$
fi
# blcr setup
restore_options=""
#restore_options="--no-restore-pid --no-restore-pgid --no-restore-sid"
if [ $save_exec == "n" ]; then
#save_options="--save-private --save-shared"
save_options="--save-none"
else
save_options="--save-all"
fi
# environment
export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
export PATH=/share/apps/blcr/0.8.5/${queue}/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/${queue}/lib:$LD_LIBRARY_PATH
#which mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
# setup checkpoints dir
if [ ! -d $checkpoints/$LSB_JOBID ]; then
mkdir -p $checkpoints/$LSB_JOBID
else
echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
kill $$
fi
# save process id and path and start application
if [ "$mode" == "start" ]; then
# hostfile
echo "${LSB_HOSTS}" > $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID
tr '\/ ' '\r\n' < $HOME/.lsbatch/hostfile.tmp.$LSB_JOBID > $HOME/.lsbatch/hostfile.$LSB_JOBID
$pre_cmd
cr_mpirun -am ft-enable-cr --gmca snapc_base_global_snapshot_dir $checkpoints/$LSB_JOBID \
--hostfile $HOME/.lsbatch/hostfile.$LSB_JOBID $cmd &
pid=$!
pwd > $checkpoints/$LSB_JOBID/pwd.$pid
orgjobpid=0
# otherwise restart the job
elif [ "$mode" == "restart" ]; then
orgpid=`ls $checkpoints/$orgjobpid/pwd.* | awk -F\. '{print $2}'`
orgpwd=`cat $checkpoints/$orgjobpid/pwd.$orgpid`
# cleanup old
rm -rf /sanscratch/$orgjobpid $HOME/.lsbatch/*.$orgjobpid
# stage old
scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
scp -r $checkpoints/$orgjobpid/* $MYSANSCRATCH
ln -s $MYSANSCRATCH /sanscratch/$orgjobpid
scp $checkpoints/$orgjobpid/hostfile.$orgjobpid $HOME/.lsbatch/
cr_restart --kmsg-warning $restore_options --relocate $orgpwd=$MYSANSCRATCH $MYSANSCRATCH/chk.$orgpid &
pid=$!
started=` ps -u hmeij | awk '{print $1}' | grep $pid | wc -l`
if [ $started -ne 1 ]; then
echo "Error: cr_restart failed, check error log"
kill $$
fi
pwd > $checkpoints/$LSB_JOBID/pwd.$pid
# obviously
else
echo "Error: startup mode not defined correctly"
kill $$
fi
# if $cmd disappears during $pcit, terminate wrapper
export POST_CMD="$post_cmd"
blcr_watcher $pid $$ $LSB_JOBID $orgjobpid &
# always run this block
while [ true ]; do
# checkpoint time interval
sleep $cpti
# checkpoint file outside of sanscratch
scp -r $MYSANSCRATCH/* $checkpoints/$LSB_JOBID/
cr_checkpoint --tree $save_options -f $checkpoints/$LSB_JOBID/chk.$pid $pid
scp $HOME/.lsbatch/*.$LSB_JOBID.err $HOME/.lsbatch/*.$LSB_JOBID.out $checkpoints/$LSB_JOBID/
scp $HOME/.lsbatch/hostfile.$LSB_JOBID $checkpoints/$LSB_JOBID/
done
\\
**[[cluster:0|Back]]**