User Tools

Site Tools


cluster:172

This is an old revision of the document!


Table of Contents


Back

K20 Redo

In 2013 we bought five servers each with 4 K20 GPUs inside. Since then they have been used but not maintained. Since we have newer GPUs (consult page GTX 1080 Ti) usage has dropped off somewhat. So I'm taking the opportunity to redo them using latest Nvidia, CentOS and application software. After all, it provides 23 teraflops GPU compute capacity (dpfp).

We'll add the Nvidia development and tools packages to our golden image node. Followed by compilations of Amber, Gromacs, Lammps and Namd. Join the node to our scheduler environment. Then we'll build the node into a massively parallel database server using mapd (running sql on gpus, that should be fun).

Test it all out. Then decide when to image the other servers.

We will place all software in /usr/local keeping it out of the CHROOT when vnfs is build (which is CentOS 7.2 anyways). We will add the tarball unpacking to the post imaging script for installation. How to build provision server OpenHPC 1.3.1, how to build Warewulf Golden Image. Keep the packages synced (other than kernel) between 7.2 CHROOT and 7.5 compute node.

Note

Upon an 7.2 image you need to

  • copy passwd, shadow, group, hosts, fstab from global archive
  • check polkit user … screws up systemd-logind
  • connextX mlx4_0 IB interface breaks in CentOS 7.3+

Nvidia

# ifdown eth1, hook up public network, ifup eth1
# route add default gw 129.133.52.1
systemctl start iptables

# 7.5
yum update kernel kernel-tools kernel-tools-libs
yum install kernel-devel kernel-headers (remove old headers after reboot)
yum install gcc gcc-devel gcc-gfortran gcc-gfortran-devel

# download runfiles from https://developer.nvidia.com/cuda-downloads
sh cuda_name_of_runfile
sh cuda_name_of_runfile_patch

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26?
(y)es/(n)o/(q)uit: n

Install the CUDA 9.2 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.2 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.2 Samples?
(y)es/(n)o/(q)uit: n

#/etc/modprobe.d/blacklist-nouveau.conf, reboot before driver instllation
blacklist nouveau
options nouveau modeset=0
reboot

# nvidia driver
./cuda_name_of_runfile -silent -driver

# backup
[root@n37 src]# rpm -qf /usr/lib/libGL.so
file /usr/lib/libGL.so is not owned by any package
cp /usr/lib/libGL.so /usr/lib/libGL.so-nvidia

[root@n37 src]# ls /etc/X11/xorg.conf
ls: cannot access /etc/X11/xorg.conf: No such file or directory
[root@n37 src]# find /usr/local/cuda-9.2 -name nvidia-xconfig*
[root@n37 src]#
[root@n37 src]# scp n78:/etc/X11/xorg.conf /etc/X11/

# Device files/dev/nvidia* exist with 0666 permissions?
# They were not 
/usr/local/src/nvidia-modprobe.sh

# new kernel initramfs, load
dracut --force
reboot

For the user environment

  • export PATH=/usr/local/cuda/bin:$PATH
  • export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH


Back

cluster/172.1534939084.txt.gz · Last modified: 2018/08/22 11:58 by hmeij07