cluster:124
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| cluster:124 [2013/10/31 17:31] – created hmeij | cluster:124 [2016/03/11 20:14] (current) – hmeij07 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | BLCR | + | \\ |
| + | **[[cluster: | ||
| - | are modules loaded (done via /etc/rc.local) | + | Queue '' |
| + | | ||
| + | Adjust your PATH and LD_LIBRARY_PATH accordingly. | ||
| + | |||
| + | ==== BLCR ==== | ||
| + | |||
| + | So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress. | ||
| + | |||
| + | I've decided to support one checkpoint/ | ||
| + | |||
| + | BLCR consists of two kernel modules, some user-level libraries, and several command-line executables. No kernel patching is required. Modules are loading upon boot via / | ||
| + | |||
| + | * [[https:// | ||
| + | * [[https:// | ||
| + | |||
| + | First lets test on a node to grasp the concept. | ||
| + | |||
| + | < | ||
| + | |||
| + | # are modules loaded | ||
| [hmeij@n33 blcr]$ lsmod | grep blcr | [hmeij@n33 blcr]$ lsmod | grep blcr | ||
| blcr 115529 | blcr 115529 | ||
| blcr_imports | blcr_imports | ||
| - | set env | ||
| - | | ||
| - | | ||
| + | # set env | ||
| + | export PATH=/ | ||
| + | export LD_LIBRARY_PATH=/ | ||
| + | # is it all working | ||
| [hmeij@n33 blcr]$ cr_checkpoint --help | [hmeij@n33 blcr]$ cr_checkpoint --help | ||
| Usage: cr_checkpoint [options] ID | Usage: cr_checkpoint [options] ID | ||
| Line 23: | Line 44: | ||
| ... | ... | ||
| + | # and here is our application and output (one extra character per second) | ||
| + | [hmeij@n33 blcr]$ ./ | ||
| + | * | ||
| + | ** | ||
| + | *** | ||
| + | **** | ||
| + | ***** | ||
| + | ****** | ||
| + | ... | ||
| + | </ | ||
| + | |||
| + | So now lets run this under BLCR and observe what happens. | ||
| + | |||
| + | < | ||
| + | |||
| + | # start application | ||
| [hmeij@n33 blcr]$ cr_run ./ | [hmeij@n33 blcr]$ cr_run ./ | ||
| [1] 12789 | [1] 12789 | ||
| + | # observe PID | ||
| [hmeij@n33 blcr]$ ps | [hmeij@n33 blcr]$ ps | ||
| PID TTY TIME CMD | PID TTY TIME CMD | ||
| Line 32: | Line 70: | ||
| 28257 pts/ | 28257 pts/ | ||
| + | # wait, then checkpoint and terminate process | ||
| [hmeij@n33 blcr]$ sleep 30 | [hmeij@n33 blcr]$ sleep 30 | ||
| [hmeij@n33 blcr]$ cr_checkpoint --term 12789 | [hmeij@n33 blcr]$ cr_checkpoint --term 12789 | ||
| [1]+ Terminated | [1]+ Terminated | ||
| + | # save the output | ||
| [hmeij@n33 blcr]$ mv context context.save | [hmeij@n33 blcr]$ mv context context.save | ||
| + | </ | ||
| + | |||
| + | Ok. Next we use '' | ||
| + | |||
| + | < | ||
| + | |||
| + | # restart in background | ||
| [hmeij@n33 blcr]$ cr_restart ./ | [hmeij@n33 blcr]$ cr_restart ./ | ||
| [1] 13579 | [1] 13579 | ||
| + | |||
| + | # wait and terminate the restart | ||
| [hmeij@n33 blcr]$ sleep 30 | [hmeij@n33 blcr]$ sleep 30 | ||
| [hmeij@n33 blcr]$ kill %1 | [hmeij@n33 blcr]$ kill %1 | ||
| [1]+ Terminated | [1]+ Terminated | ||
| + | </ | ||
| + | |||
| + | So what we're interested in is the boundary between first termination and subsequent restart. | ||
| + | |||
| + | < | ||
| [hmeij@n33 blcr]$ tail context.save | [hmeij@n33 blcr]$ tail context.save | ||
| Line 68: | Line 122: | ||
| ************************************************************ | ************************************************************ | ||
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | # pretty nifty! |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | # but be forewarned that there are binary characters lurking at this boundary |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | # you can strip them out with '' |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | # it looks like this |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@*************************************************** |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | </ |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | Now we can write a batch script for the scheduler. |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * The job will always end up in / |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * The checkpoint file should be written to a safe place, like /home |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * The time interval for checkpointing should be sufficiently large to not slow the job down |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * for example set it to 12 hours or 24 hours even |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * the small interval times in script is just for testing |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * Then there are 2 blocks of line sto (un)comment |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * One to invoke '' |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * One to invoke '' |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ | + | * For a restart we need tow things |
| - | ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@*************************************************** | + | * Create a link from old working directory to new working directory (saved in the pwd text file) |
| + | * And edit the script and change the comment blocks and edit the process_id | ||
| + | * The restart job may end up on another node but will same process_id | ||
| + | |||
| + | After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to. To simulate a crash, while your first submission is running with '' | ||
| + | |||
| + | It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals. | ||
| + | |||
| + | |||
| + | ** run.serial** | ||
| + | |||
| + | < | ||
| + | |||
| + | #!/bin/bash | ||
| + | # submit via 'bsub < run.serial' | ||
| + | rm -f *err *out *shell | ||
| + | #BSUB -q test | ||
| + | #BSUB -n 1 | ||
| + | #BSUB -J test | ||
| + | #BSUB -o out | ||
| + | #BSUB -e err | ||
| + | |||
| + | export PATH=/ | ||
| + | export LD_LIBRARY_PATH=/ | ||
| + | |||
| + | # checkpoint file is defined in while loop | ||
| + | MYSANSCRATCH=/ | ||
| + | MYLOCALSCRATCH=/ | ||
| + | export MYSANSCRATCH MYLOCALSCRATCH | ||
| + | cd $MYSANSCRATCH | ||
| + | |||
| + | # stage the application (plus data if needed) | ||
| + | cp -rp ~/ | ||
| + | |||
| + | # on first start of application, | ||
| + | # save some stuff for checking later and restart | ||
| + | #cr_run ./ | ||
| + | #sleep 60 | ||
| + | # | ||
| + | #pwd > pwd.$process_id | ||
| + | #cp -p pwd* *.shell *.out *.err ~/blcr/ | ||
| + | |||
| + | # on restart, give cr_restart some time to set up | ||
| + | # WARNING: it will overwrite the checkpoint file, save it | ||
| + | # you need to find the process_id and supply it | ||
| + | process_id=4711 | ||
| + | cp -p ~/ | ||
| + | mv ~/ | ||
| + | ln -s $MYSANSCRATCH `cat ~/ | ||
| + | cr_restart ~/ | ||
| + | sleep 60 | ||
| + | |||
| + | # always uncommented | ||
| + | echo " | ||
| + | while [ $process_id -gt 0 ]; do | ||
| + | # checkpoint time interval, make it very large (small for testing) | ||
| + | sleep 120 | ||
| + | # save the checkpoint file outside of sanscratch | ||
| + | cr_checkpoint -f ~/ | ||
| + | cp -p context ~/blcr/ | ||
| + | # if the application has crashed, or finished, exit | ||
| + | process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk ' | ||
| + | if [ " | ||
| + | rm -f `cat ~/ | ||
| + | exit; | ||
| + | fi | ||
| + | done | ||
| - | [hmeij@n33 blcr]$ ./run.serial& | + | </code> |
| - | [1] 2082 | + | |
| - | [hmeij@n33 blcr]$ process_id=2084 | + | |
| - | sleep 140; kill 2084 | + | |
| - | [hmeij@n33 blcr]$ ./ | + | |
| - | ./ | + | |
| - | Checkpoint failed: no processes checkpointed | + | |
| - | ll | + | |
| - | total 344 | + | |
| - | -r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 | + | |
| - | -rw-r--r-- 1 hmeij its 12560 Oct 31 10:23 context | + | |
| - | -rw-r--r-- 1 hmeij its 5643 Oct 31 10:18 info.txt | + | |
| - | -rw-r--r-- 1 hmeij its 2867 Oct 30 14:27 lsf_readme.txt | + | |
| - | -rwxr--r-- 1 hmeij its 657 Oct 31 10:08 run.serial | + | |
| - | -rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 | + | |
| - | [1]+ Done ./ | + | |
| - | [hmeij@n33 blcr]$ tail -1 context | + | |
| - | ************************************************************************************************************************************************************* | + | |
| - | [hmeij@n33 blcr]$ tail -1 context | wc -c | + | |
| - | 158 | + | |
| - | [hmeij@sharptail ~]$ ll / | + | \\ |
| - | total 16 | + | **[[cluster:0|Back]]** |
| - | -rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 | + | |
| - | -rw------- 1 hmeij its | + | |
| - | -rw------- 1 hmeij its 0 Oct 31 11:06 1383231850.62322.out | + | |
| - | -rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell | + | |
| - | -rw-r--r-- 1 hmeij its 0 Oct 31 11:07 context | + | |
| - | -rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 | + | |
| - | [hmeij@sharptail ~]$ ll ~/.ls | + | |
| - | ls: cannot access / | + | |
| - | [hmeij@sharptail ~]$ ll ~/ | + | |
| - | total 0 | + | |
| - | lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> | + | |
| - | / | + | |
| - | lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> | + | |
| - | / | + | |
| - | lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> | + | |
| - | / | + | |
| - | lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> | + | |
| - | / | + | |
cluster/124.1383240699.txt.gz · Last modified: by hmeij
