User Tools

Site Tools


cluster:190

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:190 [2020/01/17 10:23]
hmeij07 [DMTCP]
cluster:190 [2020/09/28 07:38] (current)
hmeij07
Line 7: Line 7:
   * DMTCP (Distributed MultiThreaded Checkpointing)   * DMTCP (Distributed MultiThreaded Checkpointing)
  
-DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk +DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation.  It works under Linux, with no modifications to the Linux kernel nor to the application binaries.  It can be used by 
-a distributed computation.  It works under Linux, with no modifications +unprivileged users (no root privilege needed).  One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior to restarting.
-to the Linux kernel nor to the application binaries.  It can be used by +
-unprivileged users (no root privilege needed).  One can later restart +
-from a checkpoint, or even migrate the processes by moving the checkpoint +
-files to another host prior to restarting.+
  
-This is a replacement for THE BLCR methods we used to use.+This is a replacement for THE BLCR methods we used [[cluster:147|BLCR Checkpoint in OL3 -serial]] or [[cluster:148|BLCR Checkpoint in OL3 - parallel]] ... BLCR is not being developed anymore. Today's brief power outage removed the BLCR kernel module HPCC wide. So learn DMTCP. I have not provided wrappers but you can follow the same logic as we used with BLCR. 
 + 
 +Write your checkpoint files in ''/sanscratch/checkpoints/JOBPID'' so it does not add into your quota.  The scheduler will not create this directory for you, you must do this in your submit job.  Directories will automatically be delete if 120 days old.
  
 <code> <code>
Line 20: Line 18:
 # sample steps to put in your scripts # sample steps to put in your scripts
  
-# make a directory (done by scheduler) +# make a directory (first one done by scheduler, second one done by your script
-mkdir -p /sanscratch/111 /snascratch/checkpoints/111+mkdir -p /sanscratch/111 /sanscratch/checkpoints/111
 cd /sanscratch/111 cd /sanscratch/111
 +
 +# invoke sample rpogram which generates one line of output
 time ./a.out time ./a.out
  
Line 36: Line 36:
 [hmeij@cottontail2 111]$ cat fid.txt [hmeij@cottontail2 111]$ cat fid.txt
   0.999578495067887       7.978578277139264E-005   0.999578495067887       7.978578277139264E-005
 +
 +# for example, with 24 hours checkpoint interval
 +# launch new coordinator on random port, log port, 24 hour checkpoints 
 +# make sure you create destination dir for checkpoints
 +dmtcp_launch --new-coordinator \
 +  --coord-port 0 --port-file port.txt --interval 86400 \
 +  --ckptdir /sanscratch/checkpoints/111 \
 +  time ./a.out
 +
      
-# run command below with interval 300 (every 5 mins)+# run command above  with interval 300 (every 5 mins) 
 [hmeij@cottontail2 111]$ ps [hmeij@cottontail2 111]$ ps
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 49: Line 59:
 1 S hmeij    20008      0  80   0 -  4665 ep_pol 07:46 pts/1    00:00:00 \\ 1 S hmeij    20008      0  80   0 -  4665 ep_pol 07:46 pts/1    00:00:00 \\
 /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
 +
 # the random port (in case somebody else is also checkpointing on this host # the random port (in case somebody else is also checkpointing on this host
 [hmeij@cottontail2 111]$ cat port.txt [hmeij@cottontail2 111]$ cat port.txt
Line 59: Line 70:
 -rwxr--r-- 1 hmeij its   12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ -rwxr--r-- 1 hmeij its   12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 lrwxrwxrwx 1 hmeij its      60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\ lrwxrwxrwx 1 hmeij its      60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 +
 [hmeij@cottontail2 111]$ [hmeij@cottontail2 111]$
 real    63m57.287s real    63m57.287s
Line 69: Line 81:
 -i 300 --ckptdir /sanscratch/checkpoints/111  ./a.out & -i 300 --ckptdir /sanscratch/checkpoints/111  ./a.out &
 [1] 29201 [1] 29201
 +
 [hmeij@cottontail2 111]$ ps [hmeij@cottontail2 111]$ ps
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 76: Line 89:
 29210 pts/1    00:00:00 dmtcp_coordinat 29210 pts/1    00:00:00 dmtcp_coordinat
 29212 pts/1    00:00:00 ps 29212 pts/1    00:00:00 ps
 +
 +# terminate half way through
 [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210 [hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210
  
Line 81: Line 96:
 cd /sanscratch/checkpoints/111 cd /sanscratch/checkpoints/111
 ./dmtcp_restart_script.sh ./dmtcp_restart_script.sh
 +
 # ps  # ps 
 0 S hmeij    20891 20890  -bash 0 S hmeij    20891 20890  -bash
Line 86: Line 102:
 1 S hmeij    31201      /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon 1 S hmeij    31201      /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
  
 +# You must make sure the old directory and file exists, otherwise
 +[40000] ERROR at fileconnection.cpp:737 in refill; 
 +REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
 +     _path = /sanscratch/111/fid.txt
 +Message: File not found.
 +a.out (40000): Terminating...
 +# that is because at checkpoint time that file was opened by a.out
  
  
-launch new coordinator on random port, log port, 24 hour checkpoints  +The process will pick up from last checkpoint 
-make sure you create destination dir for checkpoints+and write output to original work directory
  
-dmtcp_launch --new-coordinator \ +</code>
-  --coord-port 0 --port-file port.txt --interval 86400 \ +
-  --ckptdir /sanscratch/checkpoints/111 \ +
-  time ./a.out+
  
- 
-</code> 
 ==== Quick-Start Guide ==== ==== Quick-Start Guide ====
  
cluster/190.1579274628.txt.gz · Last modified: 2020/01/17 10:23 by hmeij07