This is an old revision of the document!
This is a new DMTCP(https://github.com/dmtcp/dmtcp.git) plugin to checkpoint- restart CUDA application with noval split-process architecture.
CRAC consists of the plugin on top of DMTCP.
This software runs in the original directory
Compilation needs gcc
version 8 or later (using 9.2.0 on CentOS 7, compute node n79
)
# env on node n79 CRAC-early-developmennt-master.zip export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH export PATH=/share/apps/CENTOS7/python/3.8.3/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS7/python/3.8.3/lib:$LD_LIBRARY_PATH export CUDA_HOME=/usr/local/cuda export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH # make in place cd /share/apps/CENTOS7/dmtcp/3.0.0.b/ ./configure make # no errors $ ls bin dmtcp_command dmtcp_discover_rm dmtcp_nocheckpoint dmtcp_rm_loclaunch dmtcp_ssh mtcp_restart dmtcp_coordinator dmtcp_launch dmtcp_restart dmtcp_srun_helper dmtcp_sshd make check # all failed, msg: checkpoint error ??? make check2 make check3