User Tools

Site Tools


cluster:148

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
cluster:148 [2016/03/29 20:00]
hmeij07
cluster:148 [2016/03/30 18:00]
hmeij07
Line 4: Line 4:
 ==== BLCR Checkpoint in OL3 ==== ==== BLCR Checkpoint in OL3 ====
  
-  * This page concerns PARALLEL mpirun jobs only+  * This page concerns PARALLEL mpirun jobs only; there are some restrictions 
 +    * all MPI threads need to be confined to one node 
 +    * restarted jobs must use the same node (not sure why)
  
   * For SERIAL jobs go here [[cluster:147|BLCR Checkpoint in OL3]]   * For SERIAL jobs go here [[cluster:147|BLCR Checkpoint in OL3]]
Line 11: Line 13:
  
   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]
 +
 +Checkpointing parallel jobs is a bit more complex than a serial job. MPI jobs are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process Ids, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files. And use the old hostfile.
 +
 +The ''blcr_wrapper_parallel' will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compile with a special "older" version of OpenMPI. MPI checkpointing has been removed in later versions of OpenMPI. Here is the admin stuff.
 +
 +<code>
 +
 +# from eric at lbl
 +./configure \
 +            --enable-ft-thread \
 +            --with-ft=cr \
 +            --enable-opal-multi-threads \
 +            --with-blcr=/share/apps/blcr/0.8.5/test \
 +            --without-tm \
 +            --prefix=/share/apps/CENTOS6/openmpi/1.6.5.cr
 +
 +# next download cr_mpirun
 +https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-295.tar.gz
 +
 +# configure and test
 +
 +export PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/bin:$PATH
 +export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.6.5.cr/lib:$LD_LIBRARY_PATH
 +
 +./configure --with-blcr=/share/apps/blcr/0.8.5/test
 +
 +============================================================================
 +Testsuite summary for cr_mpirun 295
 +============================================================================
 +# TOTAL: 3
 +# PASS:  3
 +# SKIP:  0
 +# XFAIL: 0
 +# FAIL:  0
 +# XPASS: 0
 +# ERROR: 0
 +============================================================================
 +make[1]: Leaving directory `/home/apps/src/petaltail6/cr_mpirun-295'
 +
 +# I coped cr_runmpi into /share/apps/CENTOS6/openmpi/1.6.5.cr/bin
 +# cr_runmpi needs access to all these in $PATH
 +# mpirun cr_mpirun ompi-checkpoint ompi-restart cr_checkpoint cr_restart
 +
 +# next compile you parallel software using mpicc/mpicxx from thhe 1.6.5 distro
 +
 +</code>
 +
 +
  
 <code> <code>
cluster/148.txt · Last modified: 2020/01/24 18:36 by hmeij07