User Tools

Site Tools


cluster:124

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:124 [2013/10/31 17:56]
hmeij
cluster:124 [2016/03/11 20:14] (current)
hmeij07
Line 1: Line 1:
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
 +
 +Queue ''tinymem'' supports BLCR
 + --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
 +
 +Adjust your PATH and LD_LIBRARY_PATH accordingly.
  
 ==== BLCR ==== ==== BLCR ====
Line 39: Line 44:
 ... ...
  
-# and here is our application and output+# and here is our application and output (one extra character per second)
 [hmeij@n33 blcr]$ ./t-20001030-01 [hmeij@n33 blcr]$ ./t-20001030-01
 * *
Line 50: Line 55:
 </code> </code>
  
 +So now lets run this under BLCR and observe what happens.  After we define the proper environment, we use ''cr_run'' to launch our application. Standard output and error are written into the output file ''context''. We then observe the PID of our process and use ''cr_checkpoint'' to write a checkpoint file and immediately terminate the process.
 +
 +<code>
 +
 +# start application
 [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 & [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 &
 [1] 12789 [1] 12789
  
 +# observe PID
 [hmeij@n33 blcr]$ ps [hmeij@n33 blcr]$ ps
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 59: Line 70:
 28257 pts/29   00:00:00 bash 28257 pts/29   00:00:00 bash
  
 +# wait, then checkpoint and terminate process
 [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ sleep 30
 [hmeij@n33 blcr]$ cr_checkpoint --term 12789 [hmeij@n33 blcr]$ cr_checkpoint --term 12789
 [1]+  Terminated              cr_run ./t-20001030-01 > context 2>&1 [1]+  Terminated              cr_run ./t-20001030-01 > context 2>&1
  
 +# save the output
 [hmeij@n33 blcr]$ mv context context.save [hmeij@n33 blcr]$ mv context context.save
  
 +</code>
 +
 +Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart.
 +
 +<code>
 +
 +# restart in background
 [hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 & [hmeij@n33 blcr]$ cr_restart ./context.12789 > context 2>&1 &
 [1] 13579 [1] 13579
 +
 +# wait and terminate the restart
 [hmeij@n33 blcr]$ sleep 30 [hmeij@n33 blcr]$ sleep 30
 [hmeij@n33 blcr]$ kill %1 [hmeij@n33 blcr]$ kill %1
 [1]+  Terminated              cr_restart ./context.12789 > context 2>&1 [1]+  Terminated              cr_restart ./context.12789 > context 2>&1
  
 +</code>
 +
 +So what we're interested in is the boundary between first termination and subsequent restart.  It alooks like this:
 +
 +<code>
  
 [hmeij@n33 blcr]$ tail context.save [hmeij@n33 blcr]$ tail context.save
Line 95: Line 122:
 ************************************************************ ************************************************************
  
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# pretty nifty! 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# but be forewarned that there are binary characters lurking at this boundary 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# you can strip them out with ''sed'' or ''tr'' 
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +# it looks like this
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ +
-^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************+
  
 +^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@***************************************************
  
-[hmeij@n33 blcr]$ ./run.serial& +</code>
-[1] 2082 +
-[hmeij@n33 blcr]$ process_id=2084 +
-sleep 140; kill 2084 +
-[hmeij@n33 blcr]$ ./run.serial: line 24:  2084 Terminated              cr_run +
-./t-20001030-01 context 2>&+
-Checkpoint failed: no processes checkpointed +
-ll +
-total 344 +
--r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 +
--rw-r--r-- 1 hmeij its  12560 Oct 31 10:23 context +
--rw-r--r-- 1 hmeij its   5643 Oct 31 10:18 info.txt +
--rw-r--r-- 1 hmeij its   2867 Oct 30 14:27 lsf_readme.txt +
--rwxr--r-- 1 hmeij its    657 Oct 31 10:08 run.serial +
--rwxr-xr-x 1 hmeij its   7298 Oct 17 14:16 t-20001030-01 +
-[1]+  Done                    ./run.serial +
-[hmeij@n33 blcr]$ tail -1 context +
-************************************************************************************************************************************************************* +
-[hmeij@n33 blcr]$ tail -1 context | wc -c +
-158+
  
-[hmeij@sharptail ~]$ ll /sanscratch/62322 
-total 16 
--rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.err 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.out 
--rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell 
--rw-r--r-- 1 hmeij its    0 Oct 31 11:07 context 
--rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 
-[hmeij@sharptail ~]$ ll ~/.ls 
-ls: cannot access /home/hmeij/.ls: No such file or directory 
-[hmeij@sharptail ~]$ ll ~/.lsbatch/ 
-total 0 
-lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> 
-/sanscratch/62322/1383231850.62322 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> 
-/sanscratch/62322/1383231850.62322.err 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> 
-/sanscratch/62322/1383231850.62322.out 
-lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> 
-/sanscratch/62322/1383231850.62322.shell 
  
 +Now we can write a batch script for the scheduler.  We need to do several things
 +
 +  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data
 +  * The checkpoint file should be written to a safe place, like /home
 +  * The time interval for checkpointing should be sufficiently large to not slow the job down
 +    * for example set it to 12 hours or 24 hours even
 +    * the small interval times in script is just for testing
 +  * Then there are 2 blocks of line sto (un)comment
 +    * One to invoke ''cr_run''
 +    * One to invoke ''cr_restart''
 +  * For a restart we need tow things
 +    * Create a link from old working directory to new working directory (saved in the pwd text file)
 +    * And edit the script and change the comment blocks and edit the process_id
 +      * The restart job may end up on another node but will same process_id
 +
 +After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too.
 +
 +It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
 +
 +
 +** run.serial**
 +
 +<code>
 +
 +#!/bin/bash 
 +# submit via 'bsub < run.serial'
 +rm -f *err *out *shell
 +#BSUB -q test
 +#BSUB -n 1
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +
 +export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
 +
 +# checkpoint file is defined in while loop
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# stage the application (plus data if needed)
 +cp -rp ~/blcr/t-20001030-01 .
 +
 +# on first start of application, remember the working directory
 +# save some stuff for checking later and restart
 +#cr_run ./t-20001030-01 > context 2>&1 &
 +#sleep 60
 +#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +#pwd > pwd.$process_id
 +#cp -p pwd* *.shell *.out *.err ~/blcr/
 +
 +# on restart, give cr_restart some time to set up
 +# WARNING: it will overwrite the checkpoint file, save it
 +# you need to find the process_id and supply it
 +process_id=4711
 +cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
 +mv ~/blcr/context  ~/blcr/context.save
 +ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
 +cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
 +sleep 60
 +
 +# always uncommented
 +echo "process_id=$process_id"
 +while [ $process_id -gt 0 ]; do
 +        # checkpoint time interval, make it very large (small for testing)
 +        sleep 120
 +        # save the checkpoint file outside of sanscratch
 +        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
 +        cp -p context ~/blcr/
 +        # if the application has crashed, or finished, exit
 +        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +        if [ "${process_id}x" = 'x' ]; then
 +                rm -f `cat ~/blcr/pwd.$process_id`
 +                exit;
 +        fi 
 +done
 +
 +
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/124.1383242203.txt.gz · Last modified: 2013/10/31 17:56 by hmeij