This is an old revision of the document!
In 2013 we bought five servers each with 4 K20 GPUs inside. Since then they have been used but not maintained. Since we have newer GPUs (consult page GTX 1080 Ti) usage has dropped off somewhat. So I'm taking the opportunity to redo them using latest Nvidia, CentOS and application software. After all, it provides 23 teraflops GPU compute capacity (dpfp).
We'll add the Nvidia development and tools packages to our golden image node. Followed by compilations of Amber, Gromacs, Lammps and Namd. Join the node to our scheduler environment. Then we'll build the node into a massively parallel database server using mapd
(running sql on gpus, that should be fun).
Test it all out. Then decide when to image the other servers.
We will place all software in /usr/local
keeping it out of the CHROOT when vnfs is build (which is CentOS 7.2 anyways). We will add the tarball unpacking to the post imaging script for installation. How to build provision server OpenHPC 1.3.1, how to build Warewulf Golden Image. Keep the packages synced (other than kernel) between 7.2 CHROOT and 7.5 compute node.
Upon an 7.2 image you need to
Installation
# ifdown eth1, hook up public network, ifup eth1 # route add default gw 129.133.52.1 systemctl start iptables # 7.5 yum update kernel kernel-tools kernel-tools-libs yum install kernel-devel kernel-headers (remove old headers after reboot) yum install gcc gcc-devel gcc-gfortran gcc-c++ # download runfiles from https://developer.nvidia.com/cuda-downloads sh cuda_name_of_runfile sh cuda_name_of_runfile_patch Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26? (y)es/(n)o/(q)uit: n Install the CUDA 9.2 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-9.2 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 9.2 Samples? (y)es/(n)o/(q)uit: n #/etc/modprobe.d/blacklist-nouveau.conf, reboot before driver instllation blacklist nouveau options nouveau modeset=0 reboot # nvidia driver ./cuda_name_of_runfile -silent -driver # backup [root@n37 src]# rpm -qf /usr/lib/libGL.so file /usr/lib/libGL.so is not owned by any package cp /usr/lib/libGL.so /usr/lib/libGL.so-nvidia [root@n37 src]# ls /etc/X11/xorg.conf ls: cannot access /etc/X11/xorg.conf: No such file or directory [root@n37 src]# find /usr/local/cuda-9.2 -name nvidia-xconfig* [root@n37 src]# [root@n37 src]# scp n78:/etc/X11/xorg.conf /etc/X11/ # Device files/dev/nvidia* exist with 0666 permissions? # They were not /usr/local/src/nvidia-modprobe.sh # new kernel initramfs, load dracut --force reboot
For the user environment
Verification
[root@n37 cuda-9.2]# /usr/local/cuda/extras/demo_suite/deviceQuery /usr/local/cuda/extras/demo_suite/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 4 CUDA Capable device(s) Device 0: "Tesla K20m" CUDA Driver Version / Runtime Version 9.2 / 9.2 CUDA Capability Major/Minor version number: 3.5 ... > Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU1) : Yes > Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU2) : No > Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU3) : No > Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU0) : Yes > Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU2) : No > Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU3) : No > Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU0) : No > Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU1) : No > Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU3) : Yes > Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU0) : No > Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU1) : No > Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU2) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 4, Device0 = Tesla K20m, Device1 = Tesla K20m, Device2 = Tesla K20m, Device3 = Tesla K20m Result = PASS
BandWithTest
[root@n37 cuda-9.2]# /usr/local/cuda/extras/demo_suite/bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: Tesla K20m Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6181.3 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 6530.0 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 137200.1 Result = PASS
Finish
Requirements
# As root check requirements rpm -qa | grep ^gcc rpm -qa | grep ^g++ rpm -qa | grep ^flex rpm -qa | grep ^tcsh rpm -qa | grep ^zlib rpm -qa | grep ^zlib-devel rpm -qa | grep ^bzip2 rpm -qa | grep ^bzip2-devel rpm -qa | grep ^bzip rpm -qa | grep ^bzip-devel rpm -qa | grep ^libXt rpm -qa | grep ^libXext rpm -qa | grep ^libXdmcp rpm -qa | grep ^tkinter # weird one need python 2.6.6_something rpm -qa | grep ^openmpi rpm -qa | grep ^perl | egrep "^perl-5|^perl-ExtUtils-MakeMaker" # both rpm -qa | grep ^patch rpm -qa | grep ^bison # As root install missing yum install flex bzip2-devel libXdmcp zlib zlib-devel yum install tkinter openmpi perl-ExtUtils-MakeMaker patch bison
Compilations
# as regular user # amber16 dir will be created cd /usr/local tar xvfj /share/apps/src/n33/AmberTools17.tar.bz2 tar xvfj /share/apps/src/n33/Amber16.tar.bz2 export AMBERHOME=/usr/local/amber16 cd $AMBERHOME # to preserve existing work flows export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.8.4/lib/:$LD_LIBRARY_PATH export PATH=/share/apps/CENTOS6/openmpi/1.8.4/bin:$PATH # use gnu, Y to patches, Y to miniconda # bundled netcdf, fftw ./configure gnu | tee -a amber16-install.log 2>&1 source /usr/local/amber16/amber.sh make install | tee -a amber16-install.log 2>&1 Installation of Amber16 (serial) is complete at Wed Aug 22 10:12:55 EDT 2018. ./configure -mpi gnu | tee -a amber16-install.log 2>&1 source /usr/local/amber16/amber.sh make install | tee -a amber16-install.log 2>&1 Installation of Amber16 (parallel) is complete at Wed Aug 22 10:36:45 EDT 2018. export PATH=/usr/local/cuda/bin:/usr/local/cuda/jre/bin:/usr/local/cuda/nvvm/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/jre/lib:$LD_LIBRARY_PATH # $AMBERHOME/AmberTools/src/configure2 # edit and bypass cuda test for 9.0 -> 9.2 version # please be sure to verify any results against known outcomes export CUDA_HOME=/usr/local/cuda ./configure -cuda gnu | tee -a amber16-install.log 2>&1 source /usr/local/amber16/amber.sh make install | tee -a amber16-install.log 2>&1 Installation of pmemd.cuda complete ./configure -mpi -cuda gnu | tee -a amber16-install.log 2>&1 source /usr/local/amber16/amber.sh make install | tee -a amber16-install.log 2>&1 Installation of pmemd.cuda.MPI complete [hmeij@n37 amber16]$ ls -l bin/pmemd* -rwxr-xr-x 1 hmeij its 3097968 Aug 22 10:12 bin/pmemd lrwxrwxrwx 1 hmeij its 15 Aug 22 15:19 bin/pmemd.cuda -> \ pmemd.cuda_SPFP -rwxr-xr-x 1 hmeij its 38851928 Aug 22 15:25 bin/pmemd.cuda_DPFP -rwxr-xr-x 1 hmeij its 39436704 Aug 22 16:04 bin/pmemd.cuda_DPFP.MPI lrwxrwxrwx 1 hmeij its 19 Aug 22 15:57 bin/pmemd.cuda.MPI -> \ pmemd.cuda_SPFP.MPI -rwxr-xr-x 1 hmeij its 32950848 Aug 22 15:19 bin/pmemd.cuda_SPFP -rwxr-xr-x 1 hmeij its 33531456 Aug 22 15:57 bin/pmemd.cuda_SPFP.MPI -rwxr-xr-x 1 hmeij its 33405504 Aug 22 15:31 bin/pmemd.cuda_SPXP -rwxr-xr-x 1 hmeij its 33990208 Aug 22 16:10 bin/pmemd.cuda_SPXP.MPI -rwxr-xr-x 1 hmeij its 3647784 Aug 22 10:36 bin/pmemd.MPI