User Tools

Site Tools


cluster:172

This is an old revision of the document!



Back

K20 Redo

In 2013 we bought five servers each with 4 K20 GPUs inside. Since then they have been used but not maintained. Since we have newer GPUs (consult page GTX 1080 Ti) usage has dropped off somewhat. So I'm taking the opportunity to redo them using latest Nvidia, CentOS and application software. After all, it provides 23 teraflops GPU compute capacity (dpfp).

We'll add the Nvidia development and tools packages to our golden image node. Followed by compilations of Amber, Gromacs, Lammps and Namd. Join the node to our scheduler environment. Then we'll build the node into a massively parallel database server using mapd (running sql on gpus, that should be fun).

Test it all out. Then decide when to image the other servers.

We will place all software in /usr/local keeping it out of the CHROOT when vnfs is build (which is CentOS 7.2 anyways). We will add the tarball unpacking to the post imaging script for installation. How to build provision server OpenHPC 1.3.1, how to build Warewulf Golden Image. Keep the packages synced (other than kernel) between 7.2 CHROOT and 7.5 compute node.

Note

Upon an 7.2 image you need to

  • copy passwd, shadow, group, hosts, fstab from global archive
  • check polkit user … screws up systemd-logind
  • connectX mlx4_0 IB interface breaks in CentOS 7.3+
  • unmount NFS mounts while installing nvidia as root
  • install other software as regular user

Nvidia

Installation

# ifdown eth1, hook up public network, ifup eth1
# route add default gw 129.133.52.1
systemctl start iptables

# 7.5
yum update kernel kernel-tools kernel-tools-libs
yum install kernel-devel kernel-headers (remove old headers after reboot)
yum install gcc gcc-gfortran gcc-c++  # CHROOT done
yum install tcl tcl-devel # CHROOT done

# /etc/modprobe.d/blacklist-nouveau.conf (new file by nvidia)
# reboot before driver installation # CHROOT done
blacklist nouveau
options nouveau modeset=0

# new kernel initramfs, load
dracut --force

reboot


# download runfiles from https://developer.nvidia.com/cuda-downloads
# files in /usr/local/src
sh cuda_9.2.148_396.37_linux.run


Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26?
(y)es/(n)o/(q)uit: n

Install the CUDA 9.2 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.2 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.2 Samples?
(y)es/(n)o/(q)uit: n

# nvidia driver
./cuda_name_of_runfile -silent -driver

# Device files/dev/nvidia* exist with 0666 permissions?
# They were not 
/usr/local/src/nvidia-modprobe.sh

# backup
[root@n37 src]# rpm -qf /usr/lib/libGL.so
file /usr/lib/libGL.so is not owned by any package
cp /usr/lib/libGL.so.1.7.0   /usr/lib/libGL.so.1.7.0-nvidia
cp /usr/lib64/libGl.so.1.7.0 /usr/lib64/libGL.so.1.7.0-nvidia

[root@n37 src]# ls /etc/X11/xorg.conf
ls: cannot access /etc/X11/xorg.conf: No such file or directory
[root@n37 src]# find /usr/local/cuda-9.2 -name nvidia-xconfig*
[root@n37 src]#
[root@n37 src]# scp n78:/etc/X11/xorg.conf /etc/X11/  # CHROOT done

# for mapd graphics support needs to be enabled
nvidia-smi --gom=0
# have left persistence and exclusivity at defaults for now

reboot

For the user environment

  • export PATH=/usr/local/cuda/bin:$PATH
  • export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  • export CUDA_HOME=/usr/local/cuda

Verification

[root@n37 cuda-9.2]# /usr/local/cuda/extras/demo_suite/deviceQuery 
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...          

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    3.5 
...
> Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU1) : Yes
> Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU2) : No
> Peer access from Tesla K20m (GPU0) -> Tesla K20m (GPU3) : No
> Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU0) : Yes
> Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU2) : No
> Peer access from Tesla K20m (GPU1) -> Tesla K20m (GPU3) : No
> Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU0) : No
> Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU1) : No
> Peer access from Tesla K20m (GPU2) -> Tesla K20m (GPU3) : Yes
> Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU0) : No
> Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU1) : No
> Peer access from Tesla K20m (GPU3) -> Tesla K20m (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, 
  CUDA Runtime Version = 9.2, NumDevs = 4, 
  Device0 = Tesla K20m, Device1 = Tesla K20m, 
  Device2 = Tesla K20m, Device3 = Tesla K20m
Result = PASS

BandWithTest

[root@n37 cuda-9.2]# /usr/local/cuda/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K20m
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6181.3

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6530.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     137200.1

Result = PASS

Finish

  • yum install freeglut-devel libX11-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel # CHROOT done
  • yum install blas blas-devel lapack lapack-devel #CHROOT done
  • check for /usr/lib64/libvdpau_nvidia.so
  • [root@n37 /]# tar -cvf /tmp/n37.chroot.ul.tar usr/local
  • [root@n37 /]# scp /tmp/n37.chroot.ul.tar sms_server:/var/chroots/goldimages/

Amber

Requirements

# As root check requirements # CHROOT done
rpm -qa | grep ^flex
rpm -qa | grep ^tcsh
rpm -qa | grep ^zlib
rpm -qa | grep ^zlib-devel
rpm -qa | grep ^bzip2
rpm -qa | grep ^bzip2-devel
rpm -qa | grep ^bzip
rpm -qa | grep ^bzip-devel
rpm -qa | grep ^libXt
rpm -qa | grep ^libXext
rpm -qa | grep ^libXdmcp
rpm -qa | grep ^tkinter # weird one need python 2.6.6_something
rpm -qa | grep ^openmpi
rpm -qa | grep ^perl | egrep "^perl-5|^perl-ExtUtils-MakeMaker" # both
rpm -qa | grep ^patch
rpm -qa | grep ^bison

# As root install missing # CHROOT done
# CHROOT done

Compilations

# as regular user
# amber16 dir will be created
cd /usr/local
tar xvfj /share/apps/src/n33/AmberTools17.tar.bz2
tar xvfj /share/apps/src/n33/Amber16.tar.bz2 
export AMBERHOME=/usr/local/amber16
cd $AMBERHOME

# to preserve existing work flows
export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/openmpi/1.8.4/lib/:$LD_LIBRARY_PATH
export PATH=/share/apps/CENTOS6/openmpi/1.8.4/bin:$PATH

# use gnu, Y to patches, Y to miniconda 
# bundled netcdf, fftw
./configure gnu | tee -a  amber16-install.log 2>&1
source /usr/local/amber16/amber.sh
make install | tee -a  amber16-install.log 2>&1
Installation of Amber16 (serial) is complete at Wed Aug 22 10:12:55 EDT 2018.

./configure -mpi gnu | tee -a  amber16-install.log 2>&1
source /usr/local/amber16/amber.sh
make install | tee -a amber16-install.log 2>&1
Installation of Amber16 (parallel) is complete at Wed Aug 22 10:36:45 EDT 2018.

export PATH=/usr/local/cuda/bin:/usr/local/cuda/jre/bin:/usr/local/cuda/nvvm/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/jre/lib:$LD_LIBRARY_PATH
# $AMBERHOME/AmberTools/src/configure2 
# edit and bypass cuda test for 9.0 -> 9.2 version
# please be sure to verify any results against known outcomes
export CUDA_HOME=/usr/local/cuda

./configure -cuda gnu | tee -a  amber16-install.log 2>&1 
source /usr/local/amber16/amber.sh
make install | tee -a amber16-install.log 2>&1
Installation of pmemd.cuda complete

 ./configure -mpi -cuda gnu | tee -a  amber16-install.log 2>&1
source /usr/local/amber16/amber.sh
make install | tee -a amber16-install.log 2>&1
Installation of pmemd.cuda.MPI complete

[hmeij@n37 amber16]$ ls -l bin/pmemd*
-rwxr-xr-x 1 hmeij its  3097968 Aug 22 10:12 bin/pmemd
lrwxrwxrwx 1 hmeij its       15 Aug 22 15:19 bin/pmemd.cuda -> \
pmemd.cuda_SPFP
-rwxr-xr-x 1 hmeij its 38851928 Aug 22 15:25 bin/pmemd.cuda_DPFP
-rwxr-xr-x 1 hmeij its 39436704 Aug 22 16:04 bin/pmemd.cuda_DPFP.MPI
lrwxrwxrwx 1 hmeij its       19 Aug 22 15:57 bin/pmemd.cuda.MPI -> \
pmemd.cuda_SPFP.MPI
-rwxr-xr-x 1 hmeij its 32950848 Aug 22 15:19 bin/pmemd.cuda_SPFP
-rwxr-xr-x 1 hmeij its 33531456 Aug 22 15:57 bin/pmemd.cuda_SPFP.MPI
-rwxr-xr-x 1 hmeij its 33405504 Aug 22 15:31 bin/pmemd.cuda_SPXP
-rwxr-xr-x 1 hmeij its 33990208 Aug 22 16:10 bin/pmemd.cuda_SPXP.MPI
-rwxr-xr-x 1 hmeij its  3647784 Aug 22 10:36 bin/pmemd.MPI

Tests

Although the 9.2 cuda compiled Amber passed all tests please double check your results.

export DO_PARALLEL="mpirun -np 8"
make test >> amber16-test.log 2>&1

Finish

  • [root@n37 /]# tar -cvf /tmp/n37.chroot.ul.tar usr/local
  • [root@n37 /]# scp /tmp/n37.chroot.ul.tar sms_server:/var/chroots/goldimages/

Gromacs

As root install

  • cmake, latest version, never understand why so far ahead of distro…

Download and extract source. Using same environment as Amber compilation.

 cd gromacs-2018/
 mkdir build
 cd build

 which mpicc mpicxx
/share/apps/CENTOS6/openmpi/1.8.4/bin/mpicc
/share/apps/CENTOS6/openmpi/1.8.4/bin/mpicxx

 CC=mpicc CXX=mpicxx \
   /share/apps/CENTOS7/cmake/3.12.1/bin/cmake .. \
  -DCMAKE_INSTALL_PREFIX=/usr/local/gromacs-2018 \
  -DGMX_BUILD_OWN_FFTW=ON -DGMX_MPI=ON -DGMX_GPU=ON
 CC=mpicc CXX=mpicxx make
 CC=mpicc CXX=mpicxx make install

Lammps

As root install

  • yum install libjpeg libjpeg-devel libjpeg-turbo libjpeg-turbo-devel # CHROOT done
  • yum install blas blas-devel lapack lapack-devel boost boost-devel # CHROOT done

For Lammps-22Aug18 I followed the top installation instructions at this page

The only difference in my approach was

  • to stay with openmpi-1.8.4 (not mpich3…)
  • consulting the ARCH web page I choose -arch=sm_35 (on n37 for K20)

Good thing we're doing this now, future versions of CUDA will not support the K20s anymore. In fact on that web site they are not mentioned, only the K40/K80 gpus. So we'll see what testing reveals. Please double check results against previous runs. Compile as regular user and stage lmp_mpi in /usr/local/lammps-22Aug18/

[hmeij@n37 src]$ ll /usr/local/lammps-22Aug18/
total 104356
-rwxr-xr-x 1 hmeij its 35739800 Aug 23 08:49 lmp_mpi-double-double-with-gpu
-rwxr-xr-x 1 hmeij its 35555672 Aug 23 09:11 lmp_mpi-single-double-with-gpu
-rwxr-xr-x 1 hmeij its 35559552 Aug 23 09:53 lmp_mpi-single-single-with-gpu

mapd

useradd -U mapd

# mapd.repo
[mapd-ce-cuda]
name=mapd ce - cuda
baseurl=https://releases.mapd.com/ce/yum/stable/cuda
gpgcheck=1
gpgkey=https://releases.mapd.com/GPG-KEY-mapd

yum  install \
  copy-jdk-configs java-1.8.0-openjdk-headless \
  javapackages-tools libxslt \
  lksctp-tools python-javapackages \
  python-lxml tzdata-java  nfs-utils
  # CHROOT done

yum install mapd   # n37:/usr/local/src

# User specific aliases and functions
export MAPD_USER=mapd
export MAPD_GROUP=mapd
export MAPD_STORAGE=/var/lib/mapd
export MAPD_PATH=/opt/mapd
# The $MAPD_STORAGE directory must be dedicated to MapD

Finish

  • Make the final tar file for /usr/local and post with CHROOT # done
  • Install all the packages of this page in CHROOT # marked done

To do another node, the steps are

  • add node in deploy.txt of n36.chroot/ (centos 7.2)
  • ./deploy.txt `grep node_name deploy.txt`
  • umount -a
  • ONBOOT=no, ib0 ??? connectX mlx4_0 IB interface breaks in CentOS 7.3+
  • bootlocal=EXIT then reboot then check polkit user … screws up systemd-logind
  • hostnamectl set-hostname node_name (logout/login)
  • eth1 on 129.133
  • yum update
  • yum install kernel-headers kernel-devel epel-release
  • put n37 tarball in /, unpack
  • remove cuda-9.2
  • Nvidia install: files in /usr/local/src
    • remove nouveau
    • disable selinux
    • reboot
    • sh runfile
    • ./runfile -silent -driver
    • install all CHROOT done packages
    • yum clean all
    • reboot
  • custom fstab
  • mount on 10.10
  • authorized_keys
  • scp in place from global archive…make backups
  • passwd, shadow, group, hosts
  • openlava
  • reboot for polkit, check /etc/ssh/ssh_host* perms/owners
  • /share/apps/src/openlava3 install in centOS7
  • systemctl enable
  • eth1 on 10.10, mounts ok?
  • /etc/default/grub add “nomodeset” and GRUB_RECORDFAIL_TIMEOUT (grub2-mkconfig -o /boot/grub2/grub.cfg)
    • did not help the count down
    • did fix the text console
  • reboot


Back

cluster/172.1538070191.txt.gz · Last modified: 2018/09/27 13:43 by hmeij07