Differences

This shows you the differences between two versions of the page.

--- cluster:124 [2013/10/31 18:37]
hmeij
+++ cluster:124 [2013/10/31 18:44]
hmeij
@@ Line 137: / Line 137: @@
     * One to invoke ''cr_run''
     * One to invoke ''cr_restart''
-  *
+  * For a restart we need tow things
+    * Create a link from old working directory to new working directory (saved in the pwd text file)
+    * And edit the script and change the comment blocks and edit the process_id
+      * The restart job may end up on another node but will same process_id
+After you have restarted, you can observe the tool starting from the checkpoint file you are pointing to.  To simulate a crash, while your first submission is running with ''cr_run'' you can simply find the node it is running on and the process ID (in the file *out) then issue the command ''ssh node_name kill process_id'' and wait for the next while iteration to terminate the program.  The scheduler will think the job terminate fine (job status DONE).
+It would be ever sweeter if the scheduler could be told to do all the checkpointing at intervals.  I'm investigating that but in the meantime you can do it manually.

DokuWiki