User Tools

Site Tools


cluster:124

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:124 [2013/10/31 14:32]
hmeij [BLCR]
cluster:124 [2016/03/11 15:14] (current)
hmeij07
Line 1: Line 1:
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
 +
 +Queue ''tinymem'' supports BLCR
 + --- //[[hmeij@wesleyan.edu|Henk]] 2016/03/03 13:57//
 +
 +Adjust your PATH and LD_LIBRARY_PATH accordingly.
  
 ==== BLCR ==== ==== BLCR ====
Line 129: Line 134:
 Now we can write a batch script for the scheduler.  We need to do several things Now we can write a batch script for the scheduler.  We need to do several things
  
-  * +  * The job will always end up in /sanscratch/JOBPID so we need to stage and save our data 
 +  * The checkpoint file should be written to a safe place, like /home 
 +  * The time interval for checkpointing should be sufficiently large to not slow the job down 
 +    * for example set it to 12 hours or 24 hours even 
 +    * the small interval times in script is just for testing 
 +  * Then there are 2 blocks of line sto (un)comment 
 +    * One to invoke ''cr_run'' 
 +    * One to invoke ''cr_restart'' 
 +  * For a restart we need tow things 
 +    * Create a link from old working directory to new working directory (saved in the pwd text file) 
 +    * And edit the script and change the comment blocks and edit the process_id 
 +      * The restart job may end up on another node but will same process_id 
 + 
 +After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). Or just issue a ''bkill'' command, you should be able to recover from it too. 
 + 
 +It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
  
  
Line 139: Line 159:
 # submit via 'bsub < run.serial' # submit via 'bsub < run.serial'
 rm -f *err *out *shell rm -f *err *out *shell
-#BSUB -q mw256chkpnt+#BSUB -q test
 #BSUB -n 1 #BSUB -n 1
 #BSUB -J test #BSUB -J test
Line 145: Line 165:
 #BSUB -e err #BSUB -e err
  
-export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH +export PATH=/share/apps/blcr/0.8.5/test/bin:$PATH 
-export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH+export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/test/lib:$LD_LIBRARY_PATH
  
-# checkpoint directory is /sanscratch/JOBPID+# checkpoint file is defined in while loop
 MYSANSCRATCH=/sanscratch/$LSB_JOBID MYSANSCRATCH=/sanscratch/$LSB_JOBID
 MYLOCALSCRATCH=/localscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID
Line 157: Line 177:
 cp -rp ~/blcr/t-20001030-01 . cp -rp ~/blcr/t-20001030-01 .
  
-# start the application and remeber the working directory +on first start of application, remember the working directory 
-cr_run ./t-20001030-01 > context 2>&1 & +# save some stuff for checking later and restart 
-process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'+#cr_run ./t-20001030-01 > context 2>&1 & 
-pwd > pwd.$process_id+#sleep 60 
 +#process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'
 +#pwd > pwd.$process_id 
 +#cp -p pwd* *.shell *.out *.err ~/blcr/
  
 # on restart, give cr_restart some time to set up # on restart, give cr_restart some time to set up
 # WARNING: it will overwrite the checkpoint file, save it # WARNING: it will overwrite the checkpoint file, save it
 # you need to find the process_id and supply it # you need to find the process_id and supply it
-#process_id=9089 +process_id=4711 
-#cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved +cp -p ~/blcr/checkpoint.$process_id ~/blcr/checkpoint.$process_id.saved 
-#mv ~/blcr/context  ~/blcr/context.save +mv ~/blcr/context  ~/blcr/context.save 
-#ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id` +ln -s $MYSANSCRATCH `cat ~/blcr/pwd.$process_id` 
-#cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 & +cr_restart ~/blcr/checkpoint.$process_id > context 2>&1 & 
-#sleep 60+sleep 60
  
 +# always uncommented
 echo "process_id=$process_id" echo "process_id=$process_id"
 while [ $process_id -gt 0 ]; do while [ $process_id -gt 0 ]; do
-        # checkpoint time interval, make it an hour or larger (small for testing)+        # checkpoint time interval, make it very large (small for testing)
         sleep 120         sleep 120
         # save the checkpoint file outside of sanscratch         # save the checkpoint file outside of sanscratch
         cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id         cr_checkpoint -f ~/blcr/checkpoint.$process_id $process_id
-        # if the application has crashed, exit+        cp -p context ~/blcr/ 
 +        # if the application has crashed, or finished, exit
         process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`         process_id=`ps -u hmeij | grep t-20001030-01 | grep -v grep | awk '{print $1}'`
         if [ "${process_id}x" = 'x' ]; then         if [ "${process_id}x" = 'x' ]; then
-                # save some stuff for checking 
-                cp -p pwd* *.shell *.out *.err context ~/blcr/ 
                 rm -f `cat ~/blcr/pwd.$process_id`                 rm -f `cat ~/blcr/pwd.$process_id`
                 exit;                 exit;
         fi          fi 
 done done
 +
 +
  
 </code> </code>
cluster/124.1383244345.txt.gz · Last modified: 2013/10/31 14:32 by hmeij