User Tools

Site Tools


cluster:124

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:124 [2013/10/31 13:50]
hmeij
cluster:124 [2016/03/11 15:14] (current)
hmeij07
Line 1: Line 1:
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
 +
 +Queue ''tinymem'' supports BLCR
 + --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
 +
 +Adjust your PATH and LD_LIBRARY_PATH accordingly.
  
 ==== BLCR ==== ==== BLCR ====
Line 12: Line 17:
   * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html|BLCR_Admin_Guide.html]]   * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html|BLCR_Admin_Guide.html]]
   * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html|BLCR_Users_Guide.html]]    * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html|BLCR_Users_Guide.html]] 
 +
 +First lets test on a node to grasp the concept.
  
 <code> <code>
Line 35: Line 42:
   -?, --help             print this message and exit.                 -?, --help             print this message and exit.              
       --version          print version information and exit.         --version          print version information and exit.  
 +...
 +
 +# and here is our application and output (one extra character per second)
 +[hmeij@n33 blcr]$ ./t-20001030-01
 +*
 +**
 +***
 +****
 +*****
 +******
 ... ...
 </code> </code>
  
 +So now lets run this under BLCR and observe what happens.  After we define the proper environment, we use ''cr_run'' to launch our application. Standard output and error are written into the output file ''context''. We then observe the PID of our process and use ''cr_checkpoint'' to write a checkpoint file and immediately terminate the process.
 +
 +<code>
 +
 +# start application
 [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 & [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 &
 [1] 12789 [1] 12789
  
 +# observe PID
 [hmeij@n33 blcr]$ ps [hmeij@n33 blcr]$ ps
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 47: Line 70:
 28257 pts/29   00:00:00 bash 28257 pts/29   00:00:00 bash
  
 +# wait, then checkpoint and terminate process
 [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ sleep 30
 [hmeij@n33 blcr]$ cr_checkpoint --term 12789 [hmeij@n33 blcr]$ cr_checkpoint --term 12789
 [1]+  Terminated              cr_run ./t-20001030-01 > context 2>&1 [1]+  Terminated              cr_run ./t-20001030-01 > context 2>&1
  
 +# save the output
 [hmeij@n33 blcr]$ mv context context.save [hmeij@n33 blcr]$ mv context context.save
  
 +</code>
 +
 +Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart.
 +
 +<code>
 +
 +# restart in background
 [hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 & [hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 &
 [1] 13579 [1] 13579
 +
 +# wait and terminate the restart
 [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ sleep 30
 [hmeij@n33 blcr]$ kill %1 [hmeij@n33 blcr]$ kill %1
 [1]+  Terminated              cr_restart ./context.12789 > context 2>&1 [1]+  Terminated              cr_restart ./context.12789 > context 2>&1
  
 +</code>
 +
 +So what we're interested in is the boundary between first termination and subsequent restart.  It alooks like this:
 +
 +<code>
  
 [hmeij@n33 blcr]$ tail context.save [hmeij@n33 blcr]$ tail context.save
Line 83: Line 122:
 ************************************************************ ************************************************************
  
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# pretty nifty! 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# but be forewarned that there are binary characters lurking at this boundary 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# you can strip them out with ''sed'' or ''tr'' 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# it looks like this
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************+
  
 +^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************
  
-[hmeij@n33 blcr]$ ./run.serial& +</code>
-[1] 2082 +
-[hmeij@n33 blcr]$ process_id=2084 +
-sleep 140; kill 2084 +
-[hmeij@n33 blcr]$ ./run.serial: line 24:  2084 Terminated              cr_run +
-./t-20001030-01 context 2>&+
-Checkpoint failed: no processes checkpointed +
-ll +
-total 344 +
--r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 +
--rw-r--r-- 1 hmeij its  12560 Oct 31 10:23 context +
--rw-r--r-- 1 hmeij its   5643 Oct 31 10:18 info.txt +
--rw-r--r-- 1 hmeij its   2867 Oct 30 14:27 lsf_readme.txt +
--rwxr--r-- 1 hmeij its    657 Oct 31 10:08 run.serial +
--rwxr-xr-x 1 hmeij its   7298 Oct 17 14:16 t-20001030-01 +
-[1]+  Done                    ./run.serial +
-[hmeij@n33 blcr]$ tail -1 context +
-************************************************************************************************************************************************************* +
-[hmeij@n33 blcr]$ tail -1 context | wc -c +
-158+
  
-[hmeij@sharptail ~]$ ll /sanscratch/62322 
-total 16 
--rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.err 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.out 
--rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell 
--rw-r--r-- 1 hmeij its    0 Oct 31 11:07 context 
--rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 
-[hmeij@sharptail ~]$ ll ~/.ls 
-ls: cannot access /home/hmeij/.ls: No such file or directory 
-[hmeij@sharptail ~]$ ll ~/.lsbatch/ 
-total 0 
-lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> 
-/sanscratch/62322/1383231850.62322 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> 
-/sanscratch/62322/1383231850.62322.err 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> 
-/sanscratch/62322/1383231850.62322.out 
-lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> 
-/sanscratch/62322/1383231850.62322.shell 
  
 +Now we can write a batch script for the scheduler.  We need to do several things
 +
 +  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data
 +  * The checkpoint file should be written to a safe place, like /home
 +  * The time interval for checkpointing should be sufficiently large to not slow the job down
 +    * for example set it to 12 hours or 24 hours even
 +    * the small interval times in script is just for testing
 +  * Then there are 2 blocks of line sto (un)comment
 +    * One to invoke ''cr_run''
 +    * One to invoke ''cr_restart''
 +  * For a restart we need tow things
 +    * Create a link from old working directory to new working directory (saved in the pwd text file)
 +    * And edit the script and change the comment blocks and edit the process_id
 +      * The restart job may end up on another node but will same process_id
 +
 +After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too.
 +
 +It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
 +
 +
 +** run.serial**
 +
 +<code>
 +
 +#!/bin/bash 
 +# submit via 'bsub < run.serial'
 +rm -f *err *out *shell
 +#BSUB -q test
 +#BSUB -n 1
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +
 +export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
 +
 +# checkpoint file is defined in while loop
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# stage the application (plus data if needed)
 +cp -rp ~/blcr/t-20001030-01 .
 +
 +# on first start of application, remember the working directory
 +# save some stuff for checking later and restart
 +#cr_run ./t-20001030-01 > context 2>&1 &
 +#sleep 60
 +#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +#pwd > pwd.$process_id
 +#cp -p pwd* *.shell *.out *.err ~/blcr/
 +
 +# on restart, give cr_restart some time to set up
 +# WARNING: it will overwrite the checkpoint file, save it
 +# you need to find the process_id and supply it
 +process_id=4711
 +cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
 +mv ~/blcr/context  ~/blcr/context.save
 +ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
 +cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
 +sleep 60
 +
 +# always uncommented
 +echo "process_id=$process_id"
 +while [ $process_id -gt 0 ]; do
 +        # checkpoint time interval, make it very large (small for testing)
 +        sleep 120
 +        # save the checkpoint file outside of sanscratch
 +        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
 +        cp -p context ~/blcr/
 +        # if the application has crashed, or finished, exit
 +        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +        if [ "${process_id}x" = 'x' ]; then
 +                rm -f `cat ~/blcr/pwd.$process_id`
 +                exit;
 +        fi 
 +done
 +
 +
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/124.1383241813.txt.gz · Last modified: 2013/10/31 13:50 by hmeij