User Tools

Site Tools


cluster:124

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:124 [2013/10/31 18:20]
hmeij
cluster:124 [2013/10/31 18:32]
hmeij [BLCR]
Line 73: Line 73:
 [hmeij@n33 blcr]$ mv context context.save [hmeij@n33 blcr]$ mv context context.save
  
-<code>+</code>
  
 Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart. Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart.
Line 127: Line 127:
  
  
 +Now we can write a batch script for the scheduler.  We need to do several things
  
-[hmeij@n33 blcr]$ ./run.serial& +  * 
-[1] 2082 +
-[hmeij@n33 blcr]$ process_id=2084 +
-sleep 140; kill 2084 +
-[hmeij@n33 blcr]$ ./run.serial: line 24:  2084 Terminated              cr_run +
-./t-20001030-01 > context 2>&+
-Checkpoint failed: no processes checkpointed +
-ll +
-total 344 +
--r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084 +
--rw-r--r-- 1 hmeij its  12560 Oct 31 10:23 context +
--rw-r--r-- 1 hmeij its   5643 Oct 31 10:18 info.txt +
--rw-r--r-- 1 hmeij its   2867 Oct 30 14:27 lsf_readme.txt +
--rwxr--r-- 1 hmeij its    657 Oct 31 10:08 run.serial +
--rwxr-xr-x 1 hmeij its   7298 Oct 17 14:16 t-20001030-01 +
-[1]+  Done                    ./run.serial +
-[hmeij@n33 blcr]$ tail -1 context +
-************************************************************************************************************************************************************* +
-[hmeij@n33 blcr]$ tail -1 context | wc -c +
-158+
  
-[hmeij@sharptail ~]$ ll /sanscratch/62322 
-total 16 
--rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.err 
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.out 
--rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell 
--rw-r--r-- 1 hmeij its    0 Oct 31 11:07 context 
--rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01 
-[hmeij@sharptail ~]$ ll ~/.ls 
-ls: cannot access /home/hmeij/.ls: No such file or directory 
-[hmeij@sharptail ~]$ ll ~/.lsbatch/ 
-total 0 
-lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 -> 
-/sanscratch/62322/1383231850.62322 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err -> 
-/sanscratch/62322/1383231850.62322.err 
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out -> 
-/sanscratch/62322/1383231850.62322.out 
-lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> 
-/sanscratch/62322/1383231850.62322.shell 
  
 +** run.serial**
 +
 +<code>
 +
 +#!/bin/bash 
 +# submit via 'bsub < run.serial'
 +rm -f *err *out *shell
 +#BSUB -q mw256chkpnt
 +#BSUB -n 1
 +#BSUB -J test
 +#BSUB -o out
 +#BSUB -e err
 +
 +export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH
 +
 +# checkpoint directory is /sanscratch/JOBPID
 +MYSANSCRATCH=/sanscratch/$LSB_JOBID
 +MYLOCALSCRATCH=/localscratch/$LSB_JOBID
 +export MYSANSCRATCH MYLOCALSCRATCH
 +cd $MYSANSCRATCH
 +
 +# stage the application (plus data if needed)
 +cp -rp ~/blcr/t-20001030-01 .
 +
 +# start the application and remeber the working directory
 +cr_run ./t-20001030-01 > context 2>&1 &
 +process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +pwd > pwd.$process_id
 +
 +# on restart, give cr_restart some time to set up
 +# WARNING: it will overwrite the checkpoint file, save it
 +# you need to find the process_id and supply it
 +#process_id=9089
 +#cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
 +#mv ~/blcr/context  ~/blcr/context.save
 +#ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
 +#cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
 +#sleep 60
 +
 +echo "process_id=$process_id"
 +while [ $process_id -gt 0 ]; do
 +        # checkpoint time interval, make it an hour or larger (small for testing)
 +        sleep 120
 +        # save the checkpoint file outside of sanscratch
 +        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
 +        # if the application has crashed, exit
 +        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 +        if [ "${process_id}x" = 'x' ]; then
 +                # save some stuff for checking
 +                cp -p pwd* *.shell *.out *.err context ~/blcr/
 +                rm -f `cat ~/blcr/pwd.$process_id`
 +                exit;
 +        fi 
 +done
 +
 +</code>
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
  
cluster/124.txt · Last modified: 2016/03/11 20:14 by hmeij07