Differences

This shows you the differences between two versions of the page.

--- cluster:124 [2013/10/31 14:53]
hmeij
+++ cluster:124 [2013/10/31 15:22]
hmeij
@@ Line 142: / Line 142: @@
       * The restart job may end up on another node but will same process_id
-After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE).
+After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too.
 It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
@@ Line 172: / Line 172: @@
 cp -rp ~/blcr/t-20001030-01 .
-# start the application and remeber the working directory
+# start the application and remember the working directory
+# save some stuff for checking later and restart
 cr_run ./t-20001030-01 > context 2>&1 &
 process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
 pwd > pwd.$process_id
+cp -p pwd* *.shell *.out *.err ~/blcr/
 # on restart, give cr_restart some time to set up
@@ Line 191: / Line 193: @@
         # checkpoint time interval, make it very large (small for testing)
         sleep 120
-        # save the checkpoint file outside of /sanscratch
+        # save the checkpoint file outside of sanscratch
         cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
+        cp -p context ~/blcr/
         # if the application has crashed, exit
         process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
         if [ "${process_id}x" = 'x' ]; then
-                # save some stuff for checking later
-                cp -p pwd* *.shell *.out *.err context ~/blcr/
                 rm -f `cat ~/blcr/pwd.$process_id`
                 exit;
         fi
 done
 </code>

DokuWiki

User Tools

Site Tools

Differences

Page Tools