User Tools

Site Tools


cluster:211

This is an old revision of the document!



Back

DMTCP CRAC

This is a new DMTCP(https://github.com/dmtcp/dmtcp.git) plugin to checkpoint- restart CUDA application with noval split-process architecture.

CRAC consists of the plugin on top of DMTCP.
This software runs in the original directory

Compilation needs gcc version 8 or later (using 9.2.0 on CentOS 7, compute node n79)

# env on node n79 CRAC-early-developmennt-master.zip

 export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
 export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH
 export PATH=/share/apps/CENTOS7/python/3.8.3/bin:$PATH
 export LD_LIBRARY_PATH=/share/apps/CENTOS7/python/3.8.3/lib:$LD_LIBRARY_PATH

 export CUDA_HOME=/usr/local/cuda
 export PATH=/usr/local/cuda/bin:$PATH
 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH


# make in place
cd /share/apps/CENTOS7/dmtcp/3.0.0.b/
./configure
make  # no errors
$ ls bin
dmtcp_command      dmtcp_discover_rm  dmtcp_nocheckpoint  dmtcp_rm_loclaunch  dmtcp_ssh   mtcp_restart
dmtcp_coordinator  dmtcp_launch       dmtcp_restart       dmtcp_srun_helper   dmtcp_sshd

make check  # all failed, msg: checkpoint error ???
make check2 # /bin/sh: -c: line 11: syntax error near unexpected token `&'
make check3 # /bin/sh: -c: line 11: syntax error near unexpected token `&'

cd contrib/split-cuda
# edit Makefile set to gcc/g++ in PATH
make # no errors, but missing lib

$ ls             
libdmtcp_split-cuda.so
kernel-loader.exe  
libcuda_wrappers.so 

# -lcuda -lcusparse -lcusolver -lcublas
# my 10.2 toolkit does not have cublas v11
# so linking against lowest version in hpc_sdk

# seems to have worked
$ ldd kernel-loader.exe 
	libcublas.so.11 => not found
# now
	libcublas.so.11 => /usr/local/cuda/lib64/libcublas.so.11 (0x00007fc3b877a000)

Next gobble together a gpu program like lammps/amber and test on gpu. Or you may have to wait on new compute nodes to arrive with latest toolkit and redo.


Back

cluster/211.1646071525.txt.gz · Last modified: 2022/02/28 18:05 by hmeij07