User Tools

Site Tools


cluster:190

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
cluster:190 [2020/01/14 19:50]
hmeij07 created
cluster:190 [2020/01/21 19:34]
hmeij07
Line 4: Line 4:
 ==== DMTCP ==== ==== DMTCP ====
  
 +  * https://sourceforge.net/projects/dmtcp/
 +  * DMTCP (Distributed MultiThreaded Checkpointing)
 +
 +DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk
 +a distributed computation.  It works under Linux, with no modifications
 +to the Linux kernel nor to the application binaries.  It can be used by
 +unprivileged users (no root privilege needed).  One can later restart
 +from a checkpoint, or even migrate the processes by moving the checkpoint
 +files to another host prior to restarting.
 +
 +This is a replacement for THE BLCR methods we used to used [[cluster:147|BLCR Checkpoint in OL3 -serial]] or [[cluster:148|BLCR Checkpoint in OL3 - parallel]] ... BLCR is not being developed anymore. Today's brief power outage removed the BLCR kernel module HPCC wide. So learn DMTCP. I have not provided wrappers but you can follow the same logic as we used with BLCR.
 +
 +<code>
 +
 +# sample steps to put in your scripts
 +
 +# make a directory (done by scheduler)
 +mkdir -p /sanscratch/111 /snascratch/checkpoints/111
 +cd /sanscratch/111
 +time ./a.out
 +
 +[1]+  Done                    time ./a.out 
 +real    64m0.513s
 +user    64m0.813s
 +sys     0m0.022s
 +
 +total 1108
 +-rwxr--r-- 1 hmeij its 1126428 Jan 14 14:36 a.out
 +-rw-r--r-- 1 hmeij its      49 Jan 14 15:41 fid.txt
 +
 +[hmeij@cottontail2 111]$ cat fid.txt
 +  0.999578495067887       7.978578277139264E-005
 +  
 +# run command below with interval 300 (every 5 mins)
 +[hmeij@cottontail2 111]$ ps
 +  PID TTY          TIME CMD
 +20004 pts/1    00:00:00 time
 +20008 pts/1    00:00:00 dmtcp_coordinat
 +20010 pts/1    00:03:02 a.out
 +20060 pts/1    00:00:00 ps
 +20891 pts/1    00:00:00 bash
 +
 +# the coordinator to which application process was attached
 +1 S hmeij    20008      0  80   0 -  4665 ep_pol 07:46 pts/1    00:00:00 \\
 +/usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
 +# the random port (in case somebody else is also checkpointing on this host
 +[hmeij@cottontail2 111]$ cat port.txt
 +38913
 +
 +# 5 mins checkpoints, little impact
 +[hmeij@cottontail2 111]$ sleep 6m; ll /sanscratch/checkpoints/111/
 +total 2960
 +-rw------- 1 hmeij its 3011726 Jan 16 13:53 ckpt_a.out_24945f6ae3823bbf-40000-fb2d136b9b544.dmtcp\\
 +-rwxr--r-- 1 hmeij its   12440 Jan 16 13:53 dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 +lrwxrwxrwx 1 hmeij its      60 Jan 16 13:53 dmtcp_restart_script.sh -> dmtcp_restart_script_24945f6ae3823bbf-40000-fb2d11ea62fa0.sh\\
 +[hmeij@cottontail2 111]$
 +real    63m57.287s
 +user    63m54.703s
 +sys     0m0.444s
 +
 +# lets try a termination of processes and restart 
 +[hmeij@cottontail2 111]$ time dmtcp_launch \\
 +--new-coordinator --coord-port 0 --port-file port.txt \\
 +-i 300 --ckptdir /sanscratch/checkpoints/111  ./a.out &
 +[1] 29201
 +[hmeij@cottontail2 111]$ ps
 +  PID TTY          TIME CMD
 +20891 pts/1    00:00:00 bash
 +29201 pts/1    00:00:00 bash
 +29202 pts/1    00:00:01 a.out
 +29210 pts/1    00:00:00 dmtcp_coordinat
 +29212 pts/1    00:00:00 ps
 +[hmeij@cottontail2 111]$ sleep 32m; kill -9 29202 29210
 +
 +# restart from checkpoints directory (or copy files into desired location)
 +cd /sanscratch/checkpoints/111
 +./dmtcp_restart_script.sh
 +# ps 
 +0 S hmeij    20891 20890  -bash
 +0 R hmeij    31136 20891 [DMTCP:a.out]
 +1 S hmeij    31201      /usr/bin/dmtcp_coordinator --quiet --exit-on-last --daemon
 +
 +# You must make sure the old directory and file exists, otherwise
 +[40000] ERROR at fileconnection.cpp:737 in refill; REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
 +     _path = /sanscratch/111/fid.txt
 +Message: File not found.
 +a.out (40000): Terminating...
 +# that is because at checkpoint time that file was opened by a.out
 +
 +
 +# launch new coordinator on random port, log port, 24 hour checkpoints 
 +# make sure you create destination dir for checkpoints
 +
 +dmtcp_launch --new-coordinator \
 +  --coord-port 0 --port-file port.txt --interval 86400 \
 +  --ckptdir /sanscratch/checkpoints/111 \
 +  time ./a.out
 +
 +
 +</code>
 ==== Quick-Start Guide ==== ==== Quick-Start Guide ====
  
cluster/190.txt ยท Last modified: 2020/09/28 11:38 by hmeij07