This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:147 [2016/03/18 18:13] hmeij07 |
cluster:147 [2020/02/27 18:06] (current) hmeij07 |
||
---|---|---|---|
Line 3: | Line 3: | ||
==== BLCR Checkpoint in OL3 ==== | ==== BLCR Checkpoint in OL3 ==== | ||
+ | |||
+ | **Deprecated since we did OS upgrades [[cluster: | ||
+ | We will install DMTCP as a replacement...[[cluster: | ||
+ | --- // | ||
+ | |||
+ | * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node | ||
* Installation and what it does [[cluster: | * Installation and what it does [[cluster: | ||
Line 8: | Line 14: | ||
* Users Guide [[https:// | * Users Guide [[https:// | ||
- | When we move to Openlava 3.x all queues will support checkpointing, | + | All queues will support checkpointing, |
- | Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava). | + | Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. |
- | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold | + | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should |
- | BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava | + | BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files scheduler |
- | At the bottom of this page is the v0.1 version of '' | + | At the bottom of this page is the current |
* Here is an interactive simple sample run. | * Here is an interactive simple sample run. | ||
- | | + | |
< | < | ||
Line 43: | Line 49: | ||
Connection to petaltail lost. | Connection to petaltail lost. | ||
- | # that was not too clever, log back in, restart application in another directory | + | # ooops, |
[hmeij@petaltail 187]$ cd .. | [hmeij@petaltail 187]$ cd .. | ||
[hmeij@petaltail sanscratch]$ mkdir 188 | [hmeij@petaltail sanscratch]$ mkdir 188 | ||
Line 88: | Line 94: | ||
==== Putting it all Together ==== | ==== Putting it all Together ==== | ||
- | The '' | + | The '' |
You edit the top part of '' | You edit the top part of '' | ||
Line 96: | Line 102: | ||
< | < | ||
- | [hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper | + | [hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial |
</ | </ | ||
- | ==== Files v0.1 ==== | + | ==== Files v0.2 ==== |
- | * '' | + | * '' |
< | < | ||
- | + | # | |
- | #!/bin/bash | + | # work dir and cwd |
+ | export MYSANSCRATCH=/ | ||
+ | cd $MYSANSCRATCH | ||
# at job finish, all content in / | # at job finish, all content in / | ||
# will be copied to / | # will be copied to / | ||
- | # clean this area up once in a while | + | # content older than 3 months will be removed |
- | # content older than 6 months will be removed | + | |
# SCHEDULER | # SCHEDULER | ||
Line 122: | Line 129: | ||
#BSUB -e err | #BSUB -e err | ||
- | # CHECK POINT TIME INTERVAL: 6h 12h 18h 1d | + | # CHECK POINT TIME INTERVAL: |
- | cpti=6h | + | cpti=10m |
# COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) | # COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) | ||
# always stage the application (and data if needed) | # always stage the application (and data if needed) | ||
# if application is large and static save_exec=" | # if application is large and static save_exec=" | ||
+ | save_exec=" | ||
pre_cmd=" | pre_cmd=" | ||
+ | post_cmd=" | ||
# IF START OF JOB, UNCOMMENT | # IF START OF JOB, UNCOMMENT | ||
- | # its either start or restart | + | # its either start or restart |
mode=start | mode=start | ||
queue=test | queue=test | ||
cmd=" | cmd=" | ||
- | save_exec=" | ||
# IF RESTART OF JOB, UNCOMMENT | # IF RESTART OF JOB, UNCOMMENT | ||
Line 141: | Line 149: | ||
# | # | ||
#queue=test | #queue=test | ||
- | #orgjobpid=188 | + | #orgjobpid=250 |
+ | |||
+ | # buglines: if your group/lab is mentioned set value to " | ||
+ | # " | ||
+ | do_bug1a_cmd=" | ||
+ | do_bug1b_cmd=" | ||
Line 152: | Line 165: | ||
checkpoints=/ | checkpoints=/ | ||
- | # work directory, avoid / | + | # bug commands |
- | MYSANSCRATCH=/sanscratch/ | + | bug1a_cmd=" |
- | MYLOCALSCRATCH=/localscratch/$LSB_JOBID | + | bug1b_cmd="scp $checkpoints/$orgjobpid/fid.txt |
- | export MYSANSCRATCH MYLOCALSCRATCH | + | |
- | cd $MYSANSCRATCH | + | |
# kernel modules | # kernel modules | ||
Line 162: | Line 173: | ||
if [ $mods -ne 2 ]; then | if [ $mods -ne 2 ]; then | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
Line 180: | Line 191: | ||
else | else | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
Line 190: | Line 201: | ||
pwd > $checkpoints/ | pwd > $checkpoints/ | ||
orgjobpid=0 | orgjobpid=0 | ||
+ | if [ " | ||
+ | $bug1a_cmd | ||
+ | fi | ||
# otherwise restart the job | # otherwise restart the job | ||
Line 195: | Line 209: | ||
orgpid=`ls $checkpoints/ | orgpid=`ls $checkpoints/ | ||
orgpwd=`cat $checkpoints/ | orgpwd=`cat $checkpoints/ | ||
+ | #if [ " | ||
+ | # echo " | ||
+ | # kill $$ | ||
+ | #fi | ||
scp $checkpoints/ | scp $checkpoints/ | ||
if [ $save_exec == " | if [ $save_exec == " | ||
$pre_cmd | $pre_cmd | ||
fi | fi | ||
- | | + | |
- | | + | $bug1b_cmd |
+ | fi | ||
cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/ | cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/ | ||
pid=$! | pid=$! | ||
Line 208: | Line 227: | ||
else | else | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
# if $cmd disappears during $pcit, terminate wrapper | # if $cmd disappears during $pcit, terminate wrapper | ||
+ | export POST_CMD=" | ||
blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & | blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & | ||
Line 221: | Line 241: | ||
cr_checkpoint $save_options -f $checkpoints/ | cr_checkpoint $save_options -f $checkpoints/ | ||
scp $HOME/ | scp $HOME/ | ||
- | | + | |
- | | + | $bug1a_cmd |
+ | fi | ||
done | done | ||
Line 260: | Line 281: | ||
</ | </ | ||
+ | |||
+ | |||
+ | ==== Matlab ==== | ||
+ | |||
+ | * https:// | ||
+ | |||