Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:147 [DokuWiki]

User Tools

Site Tools


cluster:147

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:147 [2016/03/24 11:16]
hmeij07
cluster:147 [2017/05/24 14:32]
hmeij07
Line 4: Line 4:
 ==== BLCR Checkpoint in OL3 ==== ==== BLCR Checkpoint in OL3 ====
  
-  * This page concerns SERIAL jobs only+  * This page concerns SERIAL jobs only; SERIAL jobs can restart on any node
  
   * Installation and what it does [[cluster:124|BLCR]]   * Installation and what it does [[cluster:124|BLCR]]
Line 14: Line 14:
 Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava). Checkpointing is an expensive operation so do not checkpoint under 6 hours. For example, if your job runs for a month checkpoint once a day, if your job runs for a week checkpoint every 12 hours. From this point on I expect all users to checkpoint. Some software does this internally (Amber, Gaussian). For applications or home grown code you can use BLCR. (Too bad it does not work out of box within Openlava).
  
-You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we shuold open files differently, this needs investigating further.+You need to test out checkpointing before you rely on it. I've notice that some local code, when opening files for output, BLCR does not notice it. The code below has such an example (file fid.txt). Hopefully future versions of BLCR will fix this. Or maybe we should open files differently, this needs investigating further.
  
 BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file. BLCR, Berkely Lab Checkpoint and Restart, remembers file paths and process ids. The code stages the necessary STDOUT and STDERR files Openlava generates and invokes the relocation feature while ignore old process ids. If an application is large, and static, it is advisable to not save the application inside the checkpoint file.
  
-At the bottom of this page is the current version of ''blcr_wrapper'' program which will hide the complexity for you. ''blcr_watcher'' is a program that is in your PATH already and will terminate the wrapper if the application done inside of a check point time interval. I will work with any group interested to customize your ''blcr_wrapper'' for your lab/group.+At the bottom of this page is the current version of ''blcr_wrapper'' program which will hide the complexity for you. ''blcr_watcher'' is a program that is in your PATH already and will terminate the wrapper if the application finishes inside of a check point time interval. I will work with any group interested to customize your ''blcr_wrapper'' for your lab/group.
  
    * Here is an interactive simple sample run.    * Here is an interactive simple sample run.
Line 45: Line 45:
 Connection to petaltail lost. Connection to petaltail lost.
  
-# that was not too clever, log back in, restart application in another directory+ooops, that was not too clever, log back in, restart application in another directory
 [hmeij@petaltail 187]$ cd .. [hmeij@petaltail 187]$ cd ..
 [hmeij@petaltail sanscratch]$ mkdir 188 [hmeij@petaltail sanscratch]$ mkdir 188
Line 98: Line 98:
 <code> <code>
  
-[hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper+[hmeij@cottontail ~/ynam]$ bsub < blcr_wrapper_serial
  
 </code> </code>
Line 105: Line 105:
 ==== Files v0.2 ==== ==== Files v0.2 ====
  
-  * ''blcr_wrapper'' at v01/home/hmeij/jobs/blcr/blcr_wrapper for non MPI jobs+  * ''blcr_wrapper_serial'' at /home/hmeij/jobs/blcr/ for non MPI jobs
  
 <code> <code>
Line 169: Line 169:
 if [ $mods -ne 2 ]; then if [ $mods -ne 2 ]; then
         echo "Error: BLCR modules not loaded on `hostname`"         echo "Error: BLCR modules not loaded on `hostname`"
-        exit+        kill $$
 fi fi
  
Line 187: Line 187:
 else else
         echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"         echo "Error: $checkpoints/$LSB_JOBID already exists, exciting"
-        exit+        kill $$
 fi fi
  
Line 207: Line 207:
         #if [ "X$orgpid" == "X" -o "X$orgpwd" == "X" ]; then         #if [ "X$orgpid" == "X" -o "X$orgpwd" == "X" ]; then
         #       echo "Error: problem with missing orgpid or orgpwd values"         #       echo "Error: problem with missing orgpid or orgpwd values"
-        #       exit+        #       kill $$
         #fi         #fi
         scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/         scp $checkpoints/$orgjobpid/*.$orgjobpid.err $checkpoints/$orgjobpid/*.$orgjobpid.out $HOME/.lsbatch/
Line 223: Line 223:
 else else
         echo "Error: startup mode not defined correctly"         echo "Error: startup mode not defined correctly"
-        exit+        kill $$
 fi fi
  
Line 277: Line 277:
  
 </code> </code>
 +
 +
 +==== Matlab ====
 +
 +  * https://www.bu.edu/tech/support/research/software-and-programming/common-languages/matlab/matlab-batch/checkpointing/
 +
  
  
cluster/147.txt ยท Last modified: 2020/02/27 13:06 by hmeij07