User Tools

Site Tools


cluster:124

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
cluster:124 [2013/10/31 18:37]
hmeij
cluster:124 [2013/10/31 18:44]
hmeij
Line 137: Line 137:
     * One to invoke ''cr_run''     * One to invoke ''cr_run''
     * One to invoke ''cr_restart''     * One to invoke ''cr_restart''
-  * +  * For a restart we need tow things 
 +    * Create a link from old working directory to new working directory (saved in the pwd text file) 
 +    * And edit the script and change the comment blocks and edit the process_id 
 +      * The restart job may end up on another node but will same process_id 
 + 
 +After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE). 
 + 
 +It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.
  
  
cluster/124.txt · Last modified: 2016/03/11 20:14 by hmeij07