Differences

This shows you the differences between two versions of the page.

--- cluster:124 [2013/10/31 18:20]
hmeij
+++ cluster:124 [2016/03/11 20:14] (current)
hmeij07
@@ Line 1: / Line 1: @@
 \\
 **[[cluster:0|Back]]**
+Queue ''tinymem'' supports BLCR
+ --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
+Adjust your PATH and LD_LIBRARY_PATH accordingly.
 ==== BLCR ====
@@ Line 73: / Line 78: @@
 [hmeij@n33 blcr]$ mv context context.save
-<code>
+</code>
 Ok.  Next we use ''cr_restart'' to restart our application by pointing it to the checkpoint file generated.  Then we'll wait a bit and terminate the restart.
@@ Line 127: / Line 132: @@
+Now we can write a batch script for the scheduler.  We need to do several things
-[hmeij@n33 blcr]$ ./run.serial&
+  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data
-[1] 2082
+  * The checkpoint file should be written to a safe place, like /home
-[hmeij@n33 blcr]$ process_id=2084
+  * The time interval for checkpointing should be sufficiently large to not slow the job down
-sleep 140; kill 2084
+    * for example set it to 12 hours or 24 hours even
-[hmeij@n33 blcr]$ ./run.serial: line 24:  2084 Terminated              cr_run
+    * the small interval times in script is just for testing
-./t-20001030-01 > context 2>&1
+  * Then there are 2 blocks of line sto (un)comment
-Checkpoint failed: no processes checkpointed
+    * One to invoke ''cr_run''
-ll
+    * One to invoke ''cr_restart''
-total 344
+  * For a restart we need tow things
--r-------- 1 hmeij its 180798 Oct 31 10:22 checkpoint.2084
+    * Create a link from old working directory to new working directory (saved in the pwd text file)
--rw-r--r-- 1 hmeij its  12560 Oct 31 10:23 context
+    * And edit the script and change the comment blocks and edit the process_id
--rw-r--r-- 1 hmeij its   5643 Oct 31 10:18 info.txt
+      * The restart job may end up on another node but will same process_id
--rw-r--r-- 1 hmeij its   2867 Oct 30 14:27 lsf_readme.txt
--rwxr--r-- 1 hmeij its    657 Oct 31 10:08 run.serial
--rwxr-xr-x 1 hmeij its   7298 Oct 17 14:16 t-20001030-01
-[1]+  Done                    ./run.serial
-[hmeij@n33 blcr]$ tail -1 context
-*************************************************************************************************************************************************************
-[hmeij@n33 blcr]$ tail -1 context | wc -c
-[hmeij@sharptail ~]$ ll /sanscratch/62322
+After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too.
-total 16
--rwx------ 1 hmeij its 1796 Oct 31 11:06 1383231850.62322
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.err
--rw------- 1 hmeij its    0 Oct 31 11:06 1383231850.62322.out
--rwxr--r-- 1 hmeij its 1457 Oct 31 11:07 1383231850.62322.shell
--rw-r--r-- 1 hmeij its    0 Oct 31 11:07 context
--rwxr-xr-x 1 hmeij its 7298 Oct 17 14:16 t-20001030-01
-[hmeij@sharptail ~]$ ll ~/.ls
-ls: cannot access /home/hmeij/.ls: No such file or directory
-[hmeij@sharptail ~]$ ll ~/.lsbatch/
-total 0
-lrwxrwxrwx 1 hmeij its 34 Oct 31 11:06 1383231850.62322 ->
-/sanscratch/62322/1383231850.62322
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.err ->
-/sanscratch/62322/1383231850.62322.err
-lrwxrwxrwx 1 hmeij its 38 Oct 31 11:06 1383231850.62322.out ->
-/sanscratch/62322/1383231850.62322.out
-lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell ->
-/sanscratch/62322/1383231850.62322.shell
+It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
+** run.serial**
+<code>
+#!/bin/bash
+# submit via 'bsub < run.serial'
+rm -f *err *out *shell
+#BSUB -q test
+#BSUB -n 1
+#BSUB -J test
+#BSUB -o out
+#BSUB -e err
+export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
+export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
+# checkpoint file is defined in while loop
+MYSANSCRATCH=/sanscratch/$LSB_JOBID
+MYLOCALSCRATCH=/localscratch/$LSB_JOBID
+export MYSANSCRATCH MYLOCALSCRATCH
+cd $MYSANSCRATCH
+# stage the application (plus data if needed)
+cp -rp ~/blcr/t-20001030-01 .
+# on first start of application, remember the working directory
+# save some stuff for checking later and restart
+#cr_run ./t-20001030-01 > context 2>&1 &
+#sleep 60
+#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
+#pwd > pwd.$process_id
+#cp -p pwd* *.shell *.out *.err ~/blcr/
+# on restart, give cr_restart some time to set up
+# WARNING: it will overwrite the checkpoint file, save it
+# you need to find the process_id and supply it
+process_id=4711
+cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
+mv ~/blcr/context  ~/blcr/context.save
+ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
+cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
+sleep 60
+# always uncommented
+echo "process_id=$process_id"
+while [ $process_id -gt 0 ]; do
+        # checkpoint time interval, make it very large (small for testing)
+        sleep 120
+        # save the checkpoint file outside of sanscratch
+        cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
+        cp -p context ~/blcr/
+        # if the application has crashed, or finished, exit
+        process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
+        if [ "${process_id}x" = 'x' ]; then
+                rm -f `cat ~/blcr/pwd.$process_id`
+                exit;
+        fi
+done
+</code>
 \\
 **[[cluster:0|Back]]**

DokuWiki

User Tools

Site Tools

Differences

Page Tools