User Tools

Site Tools


cluster:124

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision Both sides next revision
cluster:124 [2013/10/31 17:31]
hmeij created
cluster:124 [2013/10/31 17:50]
hmeij
Line 1: Line 1:
-BLCR+\\ 
 +**[[cluster:0|Back]]**
  
-are modules loaded (done via /etc/rc.local)+==== BLCR ====
  
 +So we need a day of down time to switch file server functionality from greentail to sharptail. It would be nice if everybody did not loose any computational progress.  To do that, we need to learn to checkpoint at the application level.  If a node crashes or power is lost, those applications can then restart the job from the last checkpoint.
 +
 +I've decided to support one checkpoint/restart utility, The Berkeley Laboratory Checkpoint/Restart tool. Hence this page.
 +
 +BLCR consists of two kernel modules, some user-level libraries, and several command-line executables. No kernel patching is required. Modules are loading upon boot via /etc/rc.local. The modules are dependent on the kernel source where the compilation took place.  So for our first supported BLRC modules I've chosen the mw256 queue and nodes. Here is some documentation on BLCR
 +
 +  * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html|BLCR_Admin_Guide.html]]
 +  * [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html|BLCR_Users_Guide.html]] 
 +
 +<code>
 +
 +# are modules loaded
 [hmeij@n33 blcr]$ lsmod | grep blcr [hmeij@n33 blcr]$ lsmod | grep blcr
 blcr                  115529  0 blcr                  115529  0
 blcr_imports           10715  1 blcr blcr_imports           10715  1 blcr
  
-set env 
- 1011  export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH 
- 1012  export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH 
  
 +# set env
 +export PATH=/share/apps/blcr/0.8.5/mw256/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/blcr/0.8.5/mw256/lib:$LD_LIBRARY_PATH
  
 +# is it all working
 [hmeij@n33 blcr]$ cr_checkpoint --help [hmeij@n33 blcr]$ cr_checkpoint --help
 Usage: cr_checkpoint [options] ID            Usage: cr_checkpoint [options] ID           
Line 22: Line 36:
       --version          print version information and exit.         --version          print version information and exit.  
 ... ...
 +</code>
  
 [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 & [hmeij@n33 blcr]$ cr_run ./t-20001030-01 > context 2>&1 &
Line 133: Line 148:
 lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell -> lrwxrwxrwx 1 hmeij its 40 Oct 31 11:06 1383231850.62322.shell ->
 /sanscratch/62322/1383231850.62322.shell /sanscratch/62322/1383231850.62322.shell
 +
 +
 +\\
 +**[[cluster:0|Back]]**
  
cluster/124.txt · Last modified: 2016/03/11 20:14 by hmeij07