User Tools

Site Tools


cluster:124

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:124 [2013/10/31 18:20]
hmeij
cluster:124 [2016/03/11 20:14] (current)
hmeij07
Line 1: Line 1:
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
 +
 +Queue ''tinymem'' supports BLCR
 + --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
 +
 +Adjust your PATH and LD_LIBRARY_PATH accordingly.
  
 ==== BLCR ==== ==== BLCR ====
Line 73: Line 78:
 [hmeij@n33 blcr]$ mv context context.save [hmeij@n33 blcr]$ mv context context.save
  
-<code>+</code>
  
 Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart. Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart.
Line 127: Line 132:
  
  
 +Now we can write a batch script for the scheduler.  We need to do several things
  
-[hmeij@n33 blcr]$ ./run.serial& +  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data 
-[1] 2082 +  * The checkpoint file should be written to a safe place, like /home 
-[hmeij@n33 blcr]$ process_id=2084 +  * The time interval for checkpointing should be sufficiently large to not slow the job down 
-sleep 140; kill 2084 +    * for example set it to 12 hours or 24 hours even 
-[hmeij@n33 blcr]$ ./run.serial: line 24:  2084 Terminated              cr_run +    * the small interval times in script is just for testing 
-./t-20001030-01 > context 2>&1 +  * Then there are 2 blocks of line sto (un)comment 
-Checkpoint failed: no processes checkpointed +    * One to invoke ''cr_run'' 
-ll +    * One to invoke ''cr_restart'' 
-total 344 +  * For a restart we need tow things 
--r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 +    * Create a link from old working directory to new working directory (saved in the pwd text file) 
--rw-r--r-- 1 hmeij its  12560 Oct 31 10:23 context +    And edit the script and change the comment blocks and edit the process_id 
--rw-r--r-- 1 hmeij its   5643 Oct 31 10:18 info.txt +      The restart job may end up on another node but will same process_id
--rw-r--r-- 1 hmeij its   2867 Oct 30 14:27 lsf_readme.txt +
--rwxr--r-- 1 hmeij its    657 Oct 31 10:08 run.serial +
--rwxr-xr-x 1 hmeij its   7298 Oct 17 14:16 t-20001030-01 +
-[1]+  Done                    ./run.serial +
-[hmeij@n33 blcr]$ tail -1 context +
-************************************************************************************************************************************************************* +
-[hmeij@n33 blcr]$ tail -1 context | wc -c +
-158+
  
-[hmeij@sharptail ~]$ ll /sanscratch/62322 +After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program The scheduler will think the job terminate fine (job status DONE)Or just issue a ''bkill'' command, you should be able to recover from it too.
-total 16 +
--rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 +
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.err +
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.out +
--rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell +
--rw-r--r-- 1 hmeij its    0 Oct 31 11:07 context +
--rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 +
-[hmeij@sharptail ~]$ ll ~/.ls +
-ls: cannot access /home/hmeij/.ls: No such file or directory +
-[hmeij@sharptail ~]$ ll ~/.lsbatch/ +
-total 0 +
-lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> +
-/sanscratch/62322/1383231850.62322 +
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> +
-/sanscratch/62322/1383231850.62322.err +
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> +
-/sanscratch/62322/1383231850.62322.out +
-lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> +
-/sanscratch/62322/1383231850.62322.shell+
  
 +It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
 +
 +
 +** run.serial**
 +
 +<code>
 +
 +#!/bin/bash 
 +# submit via 'bsub < run.serial'
 +rm -f *err *out *shell
 +#BSUB -q test
 +#BSUB -n 1
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +
 +export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
 +
 +# checkpoint file is defined in while loop
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# stage the application (plus data if needed)
 +cp -rp ~/blcr/t-20001030-01 .
 +
 +# on first start of application, remember the working directory
 +# save some stuff for checking later and restart
 +#cr_run ./t-20001030-01 > context 2>&1 &
 +#sleep 60
 +#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +#pwd > pwd.$process_id
 +#cp -p pwd* *.shell *.out *.err ~/blcr/
 +
 +# on restart, give cr_restart some time to set up
 +# WARNING: it will overwrite the checkpoint file, save it
 +# you need to find the process_id and supply it
 +process_id=4711
 +cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
 +mv ~/blcr/context  ~/blcr/context.save
 +ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
 +cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
 +sleep 60
 +
 +# always uncommented
 +echo "process_id=$process_id"
 +while [ $process_id -gt 0 ]; do
 +        # checkpoint time interval, make it very large (small for testing)
 +        sleep 120
 +        # save the checkpoint file outside of sanscratch
 +        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
 +        cp -p context ~/blcr/
 +        # if the application has crashed, or finished, exit
 +        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +        if [ "${process_id}x" = 'x' ]; then
 +                rm -f `cat ~/blcr/pwd.$process_id`
 +                exit;
 +        fi 
 +done
 +
 +
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/124.1383243618.txt.gz · Last modified: 2013/10/31 18:20 by hmeij