This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:147 [2016/03/17 18:57] hmeij07 [BLCR Checkpoint in OL3] |
cluster:147 [2020/02/27 18:06] (current) hmeij07 |
||
---|---|---|---|
Line 3: | Line 3: | ||
==== BLCR Checkpoint in OL3 ==== | ==== BLCR Checkpoint in OL3 ==== | ||
+ | |||
+ | **Deprecated since we did OS upgrades [[cluster: | ||
+ | We will install DMTCP as a replacement...[[cluster: | ||
+ | --- // | ||
+ | |||
+ | * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node | ||
* Installation and what it does [[cluster: | * Installation and what it does [[cluster: | ||
Line 8: | Line 14: | ||
* Users Guide [[https:// | * Users Guide [[https:// | ||
- | When we move to Openlava 3.x all queues will support checkpointing, | + | All queues will support checkpointing, |
- | Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava). | + | Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. |
- | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold | + | You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should |
- | BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava | + | BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files scheduler |
- | Here is an interactive simple sample run. At the bottom of this page is the v0.1 version of '' | + | At the bottom of this page is the current |
+ | * Here is an interactive simple sample run. | ||
+ | |||
< | < | ||
+ | # location | ||
+ | [hmeij@petaltail 187]$ pwd | ||
+ | / | ||
+ | # start the application with BLCR | ||
+ | [hmeij@petaltail 187]$ cr_run ./a.out & | ||
+ | [1] 15559 | ||
+ | |||
+ | # the application opens a file | ||
+ | [hmeij@petaltail 187]$ ll | ||
+ | total 1104 | ||
+ | -rwxr--r-- 1 hmeij its 1126428 Mar 15 15:38 a.out | ||
+ | -rw-r--r-- 1 hmeij its 0 Mar 15 15:41 fid.txt | ||
+ | |||
+ | # wait 5 mins, make a checkpoint file, pid 155559 is a.out running | ||
+ | [hmeij@petaltail 187]$ cr_checkpoint --save-all -f / | ||
+ | |||
+ | # the application runs for an hour, after a few mins pull the rug from underneath it | ||
+ | [hmeij@petaltail 187]$ pkill -u hmeij | ||
+ | Connection to petaltail lost. | ||
+ | |||
+ | # ooops, that was not too clever, log back in, restart application in another directory | ||
+ | [hmeij@petaltail 187]$ cd .. | ||
+ | [hmeij@petaltail sanscratch]$ mkdir 188 | ||
+ | [hmeij@petaltail sanscratch]$ cd 188 | ||
+ | |||
+ | # restart the application from checkpoint file in background | ||
+ | # note the relocate directive | ||
+ | [1]+ cr_restart --relocate / | ||
+ | --no-restore-pid / | ||
+ | |||
+ | # note that a.out is missing from directory | ||
+ | [hmeij@petaltail 188]$ ll | ||
+ | total 0 | ||
+ | -rw-r--r-- 1 hmeij its 0 Mar 15 15:45 fid.txt | ||
+ | |||
+ | # but a.out is running upon restart | ||
+ | # On of these sleep processes is determining when next checkpoint gets created (set in blcr_wrapper) | ||
+ | # the other sleep process determines when blcr_watcher next checks if application has finished (every 10 mins) | ||
+ | |||
+ | [hmeij@petaltail 188]$ ps | ||
+ | PID TTY TIME CMD | ||
+ | 24936 ? 00:00:00 res | ||
+ | 24942 ? 00:00:00 1458238185.218 | ||
+ | 24945 ? 00:00:00 1458238185.218. | ||
+ | 24960 ? 00:00:00 cr_restart | ||
+ | 24961 ? 00:00:00 blcr_watcher | ||
+ | 24962 ? 00:00:00 sleep | ||
+ | 24964 ? 00:00:09 a.out | ||
+ | 24965 ? 00:00:00 sleep | ||
+ | 24983 ? 00:00:00 sshd | ||
+ | 24984 ? 00:00:00 ps | ||
+ | |||
+ | # after an hour | ||
+ | [1]+ Done cr_restart --relocate / | ||
+ | | ||
+ | |||
+ | # the result | ||
+ | [hmeij@petaltail 188]$ cat fid.txt | ||
+ | 0.999577738717693 | ||
</ | </ | ||
- | + | ||
+ | |||
+ | ==== Putting it all Together ==== | ||
+ | |||
+ | The '' | ||
+ | |||
+ | You edit the top part of '' | ||
+ | |||
+ | Then submit to scheduler as usual | ||
+ | |||
+ | < | ||
+ | |||
+ | [hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Files v0.2 ==== | ||
+ | |||
+ | * '' | ||
+ | |||
+ | < | ||
+ | |||
+ | #!/bin/bash | ||
+ | # work dir and cwd | ||
+ | export MYSANSCRATCH=/ | ||
+ | cd $MYSANSCRATCH | ||
+ | |||
+ | # at job finish, all content in / | ||
+ | # will be copied to / | ||
+ | # content older than 3 months will be removed | ||
+ | |||
+ | # SCHEDULER | ||
+ | #BSUB -q test | ||
+ | #BSUB -n 1 | ||
+ | #BSUB -J test | ||
+ | #BSUB -o out | ||
+ | #BSUB -e err | ||
+ | |||
+ | # CHECK POINT TIME INTERVAL: 10m (debug) 6h 12h 18h 1d | ||
+ | cpti=10m | ||
+ | |||
+ | # COPY APPLICATION TO WORK DIR $MYSANSCRATCH (cwd) | ||
+ | # always stage the application (and data if needed) | ||
+ | # if application is large and static save_exec=" | ||
+ | save_exec=" | ||
+ | pre_cmd=" | ||
+ | post_cmd=" | ||
+ | |||
+ | # IF START OF JOB, UNCOMMENT | ||
+ | # its either start or restart block | ||
+ | mode=start | ||
+ | queue=test | ||
+ | cmd=" | ||
+ | |||
+ | # IF RESTART OF JOB, UNCOMMENT | ||
+ | # you must have pwd.JOBPID and chk.JOBPID in $orgjobpid/ | ||
+ | # | ||
+ | # | ||
+ | # | ||
+ | |||
+ | # buglines: if your group/lab is mentioned set value to " | ||
+ | # " | ||
+ | do_bug1a_cmd=" | ||
+ | do_bug1b_cmd=" | ||
+ | |||
+ | |||
+ | |||
+ | ############### | ||
+ | |||
+ | |||
+ | |||
+ | # checkpoints | ||
+ | checkpoints=/ | ||
+ | |||
+ | # bug commands | ||
+ | bug1a_cmd=" | ||
+ | bug1b_cmd=" | ||
+ | |||
+ | # kernel modules | ||
+ | mods=`lsmod | grep ^blcr | wc -l` | ||
+ | if [ $mods -ne 2 ]; then | ||
+ | echo " | ||
+ | kill $$ | ||
+ | fi | ||
+ | |||
+ | # blcr setup | ||
+ | restore_options=" | ||
+ | if [ $save_exec == " | ||
+ | save_options=" | ||
+ | else | ||
+ | save_options=" | ||
+ | fi | ||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | # setup checkpoints dir | ||
+ | if [ ! -d $checkpoints/ | ||
+ | mkdir -p $checkpoints/ | ||
+ | else | ||
+ | echo " | ||
+ | kill $$ | ||
+ | fi | ||
+ | |||
+ | # save process id and path and start application | ||
+ | if [ " | ||
+ | $pre_cmd | ||
+ | cr_run $cmd & | ||
+ | pid=$! | ||
+ | pwd > $checkpoints/ | ||
+ | orgjobpid=0 | ||
+ | if [ " | ||
+ | $bug1a_cmd | ||
+ | fi | ||
+ | |||
+ | # otherwise restart the job | ||
+ | elif [ " | ||
+ | orgpid=`ls $checkpoints/ | ||
+ | orgpwd=`cat $checkpoints/ | ||
+ | #if [ " | ||
+ | # echo " | ||
+ | # kill $$ | ||
+ | #fi | ||
+ | scp $checkpoints/ | ||
+ | if [ $save_exec == " | ||
+ | $pre_cmd | ||
+ | fi | ||
+ | if [ " | ||
+ | $bug1b_cmd | ||
+ | fi | ||
+ | cr_restart $restore_options --relocate $orgpwd=$MYSANSCRATCH $checkpoints/ | ||
+ | pid=$! | ||
+ | pwd > $checkpoints/ | ||
+ | |||
+ | # obviously | ||
+ | else | ||
+ | echo " | ||
+ | kill $$ | ||
+ | fi | ||
+ | |||
+ | # if $cmd disappears during $pcit, terminate wrapper | ||
+ | export POST_CMD=" | ||
+ | blcr_watcher $pid $$ $LSB_JOBID $orgjobpid & | ||
+ | |||
+ | # always run this block | ||
+ | while [ true ]; do | ||
+ | # checkpoint time interval | ||
+ | sleep $cpti | ||
+ | # checkpoint file outside of sanscratch | ||
+ | cr_checkpoint $save_options -f $checkpoints/ | ||
+ | scp $HOME/ | ||
+ | if [ " | ||
+ | $bug1a_cmd | ||
+ | fi | ||
+ | done | ||
+ | |||
+ | </ | ||
+ | |||
+ | * '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # | ||
+ | |||
+ | # watch a process during check point time interval | ||
+ | # if it disappears, terminate the blcr_wrapper | ||
+ | |||
+ | checkpoints=/ | ||
+ | |||
+ | watch_pid=$1 | ||
+ | watch_wrapper=$2 | ||
+ | jobpid=$3 | ||
+ | orgjobpid=$4 | ||
+ | |||
+ | while [ $watch_pid -gt 0 ]; do | ||
+ | sleep 600 | ||
+ | nopid=`ps -u $USER | grep $watch_pid | awk ' | ||
+ | if [ " | ||
+ | # save output | ||
+ | scp -rp $MYSANSCRATCH/ | ||
+ | if [ $orgjobpid -gt 0 ]; then | ||
+ | rm -f $HOME/ | ||
+ | fi | ||
+ | $POST_CMD | ||
+ | kill $watch_wrapper | ||
+ | exit; | ||
+ | fi | ||
+ | done | ||
+ | |||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | ==== Matlab ==== | ||
+ | |||
+ | * https:// | ||