Warning: Undefined array key "DOKU_PREFS" in /usr/share/dokuwiki/inc/common.php on line 2082
cluster:190 [DokuWiki]

User Tools

Site Tools


cluster:190

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:190 [2020/01/15 07:53]
hmeij07 [DMTCP]
cluster:190 [2020/01/21 14:37]
hmeij07
Line 7: Line 7:
   * DMTCP (Distributed MultiThreaded Checkpointing)   * DMTCP (Distributed MultiThreaded Checkpointing)
  
-DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk +DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation.  It works under Linux, with no modifications to the Linux kernel nor to the application binaries.  It can be used by 
-a distributed computation.  It works under Linux, with no modifications +unprivileged users (no root privilege needed).  One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior to restarting.
-to the Linux kernel nor to the application binaries.  It can be used by +
-unprivileged users (no root privilege needed).  One can later restart +
-from a checkpoint, or even migrate the processes by moving the checkpoint +
-files to another host prior to restarting.+
  
-This is a replacement for THE BLCR methods we used to use.+This is a replacement for THE BLCR methods we used [[cluster:147|BLCR Checkpoint in OL3 -serial]] or [[cluster:148|BLCR Checkpoint in OL3 - parallel]] ... BLCR is not being developed anymore. Today's brief power outage removed the BLCR kernel module HPCC wide. So learn DMTCP. I have not provided wrappers but you can follow the same logic as we used with BLCR.
  
 <code> <code>
Line 20: Line 16:
 # sample steps to put in your scripts # sample steps to put in your scripts
  
-# make a directory (done by scheduler)+# make a directory (first one done by scheduler, second one done by your script)
 mkdir -p /sanscratch/111 /snascratch/checkpoints/111 mkdir -p /sanscratch/111 /snascratch/checkpoints/111
 cd /sanscratch/111 cd /sanscratch/111
-time ./a.out | tee -a a.out.time+time ./a.out
  
-[1]+  Done                    time ./a.out | tee -a a.out.time+[1]+  Done                    time ./a.out 
 real    64m0.513s real    64m0.513s
 user    64m0.813s user    64m0.813s
Line 32: Line 28:
 total 1108 total 1108
 -rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out -rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out
--rw-r--r-- 1 hmeij its       0 Jan 14 14:37 a.out.time 
 -rw-r--r-- 1 hmeij its      49 Jan 14 15:41 fid.txt -rw-r--r-- 1 hmeij its      49 Jan 14 15:41 fid.txt
  
Line 38: Line 33:
   0.999578495067887       7.978578277139264E-005   0.999578495067887       7.978578277139264E-005
      
-# run command below with interval 36000 (10 mins)+# run command below with interval 300 (every 5 mins)
 [hmeij@cottontail2 111]$ ps [hmeij@cottontail2 111]$ ps
   PID TTY          TIME CMD   PID TTY          TIME CMD
Line 48: Line 43:
  
 # the coordinator to which application process was attached # the coordinator to which application process was attached
-1 S hmeij    20008      0  80   0 -  4665 ep_pol 07:46 pts/1    00:00:00 /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon+1 S hmeij    20008      0  80   0 -  4665 ep_pol 07:46 pts/1    00:00:00 \\ 
 +/usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
 # the random port (in case somebody else is also checkpointing on this host # the random port (in case somebody else is also checkpointing on this host
 [hmeij@cottontail2 111]$ cat port.txt [hmeij@cottontail2 111]$ cat port.txt
 38913 38913
 +
 +# 5 mins checkpoints, little impact
 +[hmeij@cottontail2 111]$ sleep 6m; ll /sanscratch/checkpoints/111/
 +total 2960
 +-rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\
 +-rwxr--r-- 1 hmeij its   12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 +lrwxrwxrwx 1 hmeij its      60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 +[hmeij@cottontail2 111]$
 +real    63m57.287s
 +user    63m54.703s
 +sys     0m0.444s
 +
 +# lets try a termination of processes and restart 
 +[hmeij@cottontail2 111]$ time dmtcp_launch \\
 +--new-coordinator --coord-port 0 --port-file port.txt \\
 +-i 300 --ckptdir /sanscratch/checkpoints/111  ./a.out &
 +[1] 29201
 +[hmeij@cottontail2 111]$ ps
 +  PID TTY          TIME CMD
 +20891 pts/1    00:00:00 bash
 +29201 pts/1    00:00:00 bash
 +29202 pts/1    00:00:01 a.out
 +29210 pts/1    00:00:00 dmtcp_coordinat
 +29212 pts/1    00:00:00 ps
 +[hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210
 +
 +# restart from checkpoints directory (or copy files into desired location)
 +cd /sanscratch/checkpoints/111
 +./dmtcp_restart_script.sh
 +# ps 
 +0 S hmeij    20891 20890  -bash
 +0 R hmeij    31136 20891 [DMTCP:a.out]
 +1 S hmeij    31201      /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
 +
 +# You must make sure the old directory and file exists, otherwise
 +[40000] ERROR at fileconnection.cpp:737 in refill; REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
 +     _path = /sanscratch/111/fid.txt
 +Message: File not found.
 +a.out (40000): Terminating...
 +# that is because at checkpoint time that file was opened by a.out
  
  
cluster/190.txt · Last modified: 2020/09/28 07:38 by hmeij07