Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

--- cluster:124 [2013/10/31 14:32]
hmeij [BLCR]
+++ cluster:124 [2016/03/11 15:14] (current)
hmeij07
@@ Line 1: / Line 1: @@
 \\
 **[[cluster:0|Back]]**
+Queue ''tinymem'' supports BLCR
+ --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
+Adjust your PATH and LD_LIBRARY_PATH accordingly.
 ==== BLCR ====
@@ Line 129: / Line 134: @@
 Now we can write a batch script for the scheduler.  We need to do several things
-  *
+  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data
+  * The checkpoint file should be written to a safe place, like /home
+  * The time interval for checkpointing should be sufficiently large to not slow the job down
+    * for example set it to 12 hours or 24 hours even
+    * the small interval times in script is just for testing
+  * Then there are 2 blocks of line sto (un)comment
+    * One to invoke ''cr_run''
+    * One to invoke ''cr_restart''
+  * For a restart we need tow things
+    * Create a link from old working directory to new working directory (saved in the pwd text file)
+    * And edit the script and change the comment blocks and edit the process_id
+      * The restart job may end up on another node but will same process_id
+After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too.
+It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
@@ Line 139: / Line 159: @@
 # submit via 'bsub < run.serial'
 rm -f *err *out *shell
-#BSUB -q mw256chkpnt
+#BSUB -q test
 #BSUB -n 1
 #BSUB -J test
@@ Line 145: / Line 165: @@
 #BSUB -e err
-export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH
+export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH
-export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
-# checkpoint directory is /sanscratch/JOBPID
+# checkpoint file is defined in while loop
 MYSANSCRATCH=/sanscratch/$LSB_JOBID
 MYLOCALSCRATCH=/localscratch/$LSB_JOBID
@@ Line 157: / Line 177: @@
 cp -rp ~/blcr/t-20001030-01 .
-# start the application and remeber the working directory
+# on first start of application, remember the working directory
-cr_run ./t-20001030-01 > context 2>&1 &
+# save some stuff for checking later and restart
-process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
+#cr_run ./t-20001030-01 > context 2>&1 &
-pwd > pwd.$process_id
+#sleep 60
+#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
+#pwd > pwd.$process_id
+#cp -p pwd* *.shell *.out *.err ~/blcr/
 # on restart, give cr_restart some time to set up
 # WARNING: it will overwrite the checkpoint file, save it
 # you need to find the process_id and supply it
-#process_id=9089
+process_id=4711
-#cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
+cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved
-#mv ~/blcr/context  ~/blcr/context.save
+mv ~/blcr/context  ~/blcr/context.save
-#ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
+ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id`
-#cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
+cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 &
-#sleep 60
+sleep 60
+# always uncommented
 echo "process_id=$process_id"
 while [ $process_id -gt 0 ]; do
-        # checkpoint time interval, make it an hour or larger (small for testing)
+        # checkpoint time interval, make it very large (small for testing)
         sleep 120
         # save the checkpoint file outside of sanscratch
         cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
-        # if the application has crashed, exit
+        cp -p context ~/blcr/
+        # if the application has crashed, or finished, exit
         process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
         if [ "${process_id}x" = 'x' ]; then
-                # save some stuff for checking
-                cp -p pwd* *.shell *.out *.err context ~/blcr/
                 rm -f `cat ~/blcr/pwd.$process_id`
                 exit;
         fi
 done
 </code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools