User Tools

Site Tools


cluster:148

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
cluster:148 [2016/03/30 18:49]
hmeij07
cluster:148 [2016/04/07 15:27]
hmeij07
Line 14: Line 14:
   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]   * Users Guide [[https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html]]
  
-Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files. And use the old hostfile.+Checkpointing parallel jobs is a bit more complex than a serial job. MPI workers (the -n) are fired off by worker 0 of ''mpirun'' and all workers may open files and perform socket to socket communications. Therefore a restart will need to restore all file IDs, process IDs, etc. A job may thus fail if a certain process ID is already running. Restarted files also behave as if the old JOBPID is running and will write results to the old STDERR and STDOUT files.
  
 The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special "older" version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI.  The ''blcr_wrapper_parallel' below will manage all this for you. Like the serial wrapper only edit the top of the file and provide the information necessary. But first, your software needs to be compiled with a special "older" version of OpenMPI. MPI checkpointing support has been removed in later versions of OpenMPI. 
Line 228: Line 228:
 #BSUB -e err #BSUB -e err
 # next required for mpirun checkpoint to work # next required for mpirun checkpoint to work
-# restarts must use same node (not sure why)+# restarts must use same node in test queue (not sure why, others ca restart anywhere)
 #BSUB -R "span[hosts=1]" #BSUB -R "span[hosts=1]"
  
cluster/148.txt ยท Last modified: 2020/01/24 18:36 by hmeij07