This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:147 [2016/03/18 18:28] hmeij07 |
cluster:147 [2016/03/30 17:45] hmeij07 |
||
---|---|---|---|
Line 3: | Line 3: | ||
==== BLCR Checkpoint in OL3 ==== | ==== BLCR Checkpoint in OL3 ==== | ||
+ | |||
+ | * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node | ||
* Installation and what it does [[cluster: | * Installation and what it does [[cluster: | ||
Line 12: | Line 14: | ||
Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava). | Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava). | ||
- | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold | + | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should |
BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file. | BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file. | ||
- | At the bottom of this page is the current version of '' | + | At the bottom of this page is the current version of '' |
* Here is an interactive simple sample run. | * Here is an interactive simple sample run. | ||
Line 43: | Line 45: | ||
Connection to petaltail lost. | Connection to petaltail lost. | ||
- | # that was not too clever, log back in, restart application in another directory | + | # ooops, |
[hmeij@petaltail 187]$ cd .. | [hmeij@petaltail 187]$ cd .. | ||
[hmeij@petaltail sanscratch]$ mkdir 188 | [hmeij@petaltail sanscratch]$ mkdir 188 | ||
Line 96: | Line 98: | ||
< | < | ||
- | [hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper | + | [hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial |
</ | </ | ||
Line 103: | Line 105: | ||
==== Files v0.2 ==== | ==== Files v0.2 ==== | ||
- | * '' | + | * '' |
< | < | ||
Line 118: | Line 120: | ||
# SCHEDULER | # SCHEDULER | ||
#BSUB -q test | #BSUB -q test | ||
+ | #BSUB -n 1 | ||
#BSUB -J test | #BSUB -J test | ||
- | #BSUB -n 6 | ||
#BSUB -o out | #BSUB -o out | ||
#BSUB -e err | #BSUB -e err | ||
- | #BSUB -R " | ||
- | |||
# CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d | # CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d | ||
Line 131: | Line 131: | ||
# always stage the application (and data if needed) | # always stage the application (and data if needed) | ||
# if application is large and static save_exec=" | # if application is large and static save_exec=" | ||
- | save_exec=" | + | save_exec=" |
- | #pre_cmd=" | + | pre_cmd=" |
- | #post_cmd=" | + | post_cmd=" |
- | pre_cmd=" | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | $HOME/ | + | |
- | post_cmd=" | + | |
# IF START OF JOB, UNCOMMENT | # IF START OF JOB, UNCOMMENT | ||
# its either start or restart block | # its either start or restart block | ||
- | #mode=start | + | mode=start |
- | #queue=test | + | queue=test |
- | #cmd=" | + | cmd=" |
- | # | + | |
# IF RESTART OF JOB, UNCOMMENT | # IF RESTART OF JOB, UNCOMMENT | ||
# you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ | # you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ | ||
- | mode=restart | + | #mode=restart |
- | queue=test | + | #queue=test |
- | orgjobpid=254 | + | #orgjobpid=250 |
# buglines: if your group/lab is mentioned set value to " | # buglines: if your group/lab is mentioned set value to " | ||
# " | # " | ||
- | do_bug1a_cmd=" | + | do_bug1a_cmd=" |
- | do_bug1b_cmd=" | + | do_bug1b_cmd=" |
- | + | ||
- | # environment mhughes/ | + | |
- | export PYTHONHOME=/ | + | |
- | export PYTHONPATH=/ | + | |
- | export PATH=$PYTHONHOME/ | + | |
- | . / | + | |
- | export PATH=$MIRBIN: | + | |
- | export PATH=/ | + | |
- | export LD_LIBRARY_PATH=/ | + | |
- | + | ||
Line 193: | Line 169: | ||
if [ $mods -ne 2 ]; then | if [ $mods -ne 2 ]; then | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
Line 211: | Line 187: | ||
else | else | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
Line 231: | Line 207: | ||
#if [ " | #if [ " | ||
# echo " | # echo " | ||
- | # exit | + | # kill $$ |
#fi | #fi | ||
scp $checkpoints/ | scp $checkpoints/ | ||
Line 247: | Line 223: | ||
else | else | ||
echo " | echo " | ||
- | | + | |
fi | fi | ||
Line 265: | Line 241: | ||
fi | fi | ||
done | done | ||
- | |||
</ | </ |