This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:124 [2013/10/31 18:20] hmeij |
cluster:124 [2016/03/11 20:14] (current) hmeij07 |
||
---|---|---|---|
Line 1: | Line 1: | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||
+ | |||
+ | Queue '' | ||
+ | --- // | ||
+ | |||
+ | Adjust your PATH and LD_LIBRARY_PATH accordingly. | ||
==== BLCR ==== | ==== BLCR ==== | ||
Line 73: | Line 78: | ||
[hmeij@n33 blcr]$ mv context context.save | [hmeij@n33 blcr]$ mv context context.save | ||
- | < | + | </code> |
Ok. Next we use '' | Ok. Next we use '' | ||
Line 127: | Line 132: | ||
+ | Now we can write a batch script for the scheduler. | ||
- | [hmeij@n33 blcr]$ ./run.serial& | + | * The job will always end up in /sanscratch/ |
- | [1] 2082 | + | |
- | [hmeij@n33 blcr]$ process_id=2084 | + | * The time interval for checkpointing should be sufficiently large to not slow the job down |
- | sleep 140; kill 2084 | + | * for example set it to 12 hours or 24 hours even |
- | [hmeij@n33 blcr]$ ./ | + | * the small interval times in script is just for testing |
- | ./t-20001030-01 > context 2>&1 | + | |
- | Checkpoint failed: no processes checkpointed | + | * One to invoke '' |
- | ll | + | |
- | total 344 | + | |
- | -r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 | + | * Create a link from old working directory to new working directory (saved in the pwd text file) |
- | -rw-r--r-- 1 hmeij its 12560 Oct 31 10:23 context | + | * And edit the script and change the comment blocks and edit the process_id |
- | -rw-r--r-- 1 hmeij its 5643 Oct 31 10:18 info.txt | + | |
- | -rw-r--r-- 1 hmeij its 2867 Oct 30 14:27 lsf_readme.txt | + | |
- | -rwxr--r-- 1 hmeij its 657 Oct 31 10:08 run.serial | + | |
- | -rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 | + | |
- | [1]+ Done ./ | + | |
- | [hmeij@n33 blcr]$ tail -1 context | + | |
- | ************************************************************************************************************************************************************* | + | |
- | [hmeij@n33 blcr]$ tail -1 context | wc -c | + | |
- | 158 | + | |
- | [hmeij@sharptail ~]$ ll / | + | After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to. To simulate a crash, while your first submission is running with '' |
- | total 16 | + | |
- | -rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 | + | |
- | -rw------- 1 hmeij its 0 Oct 31 11:06 1383231850.62322.err | + | |
- | -rw------- 1 hmeij its 0 Oct 31 11:06 1383231850.62322.out | + | |
- | -rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell | + | |
- | -rw-r--r-- 1 hmeij its 0 Oct 31 11:07 context | + | |
- | -rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 | + | |
- | [hmeij@sharptail ~]$ ll ~/.ls | + | |
- | ls: cannot access / | + | |
- | [hmeij@sharptail ~]$ ll ~/ | + | |
- | total 0 | + | |
- | lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> | + | |
- | / | + | |
- | lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> | + | |
- | / | + | |
- | lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> | + | |
- | / | + | |
- | lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> | + | |
- | / | + | |
+ | It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals. | ||
+ | |||
+ | |||
+ | ** run.serial** | ||
+ | |||
+ | < | ||
+ | |||
+ | # | ||
+ | # submit via 'bsub < run.serial' | ||
+ | rm -f *err *out *shell | ||
+ | #BSUB -q test | ||
+ | #BSUB -n 1 | ||
+ | #BSUB -J test | ||
+ | #BSUB -o out | ||
+ | #BSUB -e err | ||
+ | |||
+ | export PATH=/ | ||
+ | export LD_LIBRARY_PATH=/ | ||
+ | |||
+ | # checkpoint file is defined in while loop | ||
+ | MYSANSCRATCH=/ | ||
+ | MYLOCALSCRATCH=/ | ||
+ | export MYSANSCRATCH MYLOCALSCRATCH | ||
+ | cd $MYSANSCRATCH | ||
+ | |||
+ | # stage the application (plus data if needed) | ||
+ | cp -rp ~/ | ||
+ | |||
+ | # on first start of application, | ||
+ | # save some stuff for checking later and restart | ||
+ | #cr_run ./ | ||
+ | #sleep 60 | ||
+ | # | ||
+ | #pwd > pwd.$process_id | ||
+ | #cp -p pwd* *.shell *.out *.err ~/blcr/ | ||
+ | |||
+ | # on restart, give cr_restart some time to set up | ||
+ | # WARNING: it will overwrite the checkpoint file, save it | ||
+ | # you need to find the process_id and supply it | ||
+ | process_id=4711 | ||
+ | cp -p ~/ | ||
+ | mv ~/ | ||
+ | ln -s $MYSANSCRATCH `cat ~/ | ||
+ | cr_restart ~/ | ||
+ | sleep 60 | ||
+ | |||
+ | # always uncommented | ||
+ | echo " | ||
+ | while [ $process_id -gt 0 ]; do | ||
+ | # checkpoint time interval, make it very large (small for testing) | ||
+ | sleep 120 | ||
+ | # save the checkpoint file outside of sanscratch | ||
+ | cr_checkpoint -f ~/ | ||
+ | cp -p context ~/blcr/ | ||
+ | # if the application has crashed, or finished, exit | ||
+ | process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk ' | ||
+ | if [ " | ||
+ | rm -f `cat ~/ | ||
+ | exit; | ||
+ | fi | ||
+ | done | ||
+ | |||
+ | |||
+ | |||
+ | </ | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: | ||