\\ **[[cluster:0|Back]]** ===== EXX96 ===== A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] and [[cluster:173|K20 Redo Usage]] Page best followed bottom to top if interested in the whole process. The Usage section below is HPCC users wnatig to use queue ''exx96''. Debug for node n89 which turns itself off...grrhhh. Create a usb bootable stick with https://rufus.ie/ then unzip BIOS and firmware zip files located in ''n89:/usr/local/src'' [root@n89 ~]# ipmitool sel elist 1 | 02/29/2020 | 16:57:33 | Memory #0xd1 | Uncorrectable ECC | Asserted 2 | 03/02/2020 | 03:02:42 | Processor CPU_CATERR | IERR | Asserted 3 | 03/11/2020 | 19:27:35 | Processor CPU_CATERR | IERR | Asserted ...[snip]... [root@n89 ~]# ipmitool sdr elist CPU1 Temperature | 31h | ok | 3.0 | 43 degrees C CPU2 Temperature | 32h | ok | 0.0 | 40 degrees C PSU1 Over Temp | 92h | ok | 0.0 | Transition to OK PSU2 Over Temp | 9Ah | ok | 0.0 | Transition to OK ...[snip]... DIMMM1_Temp | E4h | ok | 3.0 | 28 degrees C CPU1_ECC1 | D1h | ok | 0.0 | Presence Detected CPU2_ECC1 | D3h | ok | 0.0 | Presence Detected ...[snip]... PMBPower1 | E1h | ok | 3.0 | 88 Watts PMBPower2 | E2h | ok | 3.0 | 112 Watts ...[snip]... FRNT_FAN1 | A2h | ok | 0.0 | 3100 RPM ../.[snip]... PSU1 Slow FAN1 | 95h | ok | 0.0 | Transition to OK PSU2 Slow FAN1 | 9Dh | ok | 0.0 | Transition to OK ...[snip]... [root@n89 ~]#dmidecode -t0 # dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2 present. Handle 0x0000, DMI type 0, 26 bytes BIOS Information Vendor: American Megatrends Inc. Version: 5102 Release Date: 02/11/2019 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 32 MB Characteristics: ...[snip]... UEFI is supported BIOS Revision: 5.14 [root@n89 ~]# edac-util -s -v edac-util: EDAC drivers are loaded. 4 MCs detected: mc0:Skylake Socket#0 IMC#0 mc1:Skylake Socket#0 IMC#1 mc2:Skylake Socket#1 IMC#0 mc3:Skylake Socket#1 IMC#1 [root@n89 ~]# edac-util edac-util: No errors to report. syslog ==== Usage ==== The new queue ''exx96'' will be comprised of nodes ''n79-n90''. Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz 12-core cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GB. A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' and the nodes mentioned above. The name of this resource is ''gpu4''. Moving forward please use it instead of ''gpu'' or ''gputest''. The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1''. If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx'' The wrappers (n78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''. # command that shows gpu reservations bhosts -l n79 gputest gpu4 Total 0 3 Reserved 0.0 1.0 # old way of doing that lsload -l n79 HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem gpu n79 ok 0.0 0.0 0.0 0% 0.0 0 0 2e+08 826G 10G 90G 3.0 Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware. This will get rather messy in setting up. Some quick off the cuff performance data reveals some impact. Generally in our environment the gains are not worth the effort. Using Amber and ''pmemd.cuda.MPI'' cpu:gpu mdout.325288:| Master Total CPU time: 982.60 seconds 0.27 hours 1:1 mdout.325289:| Master Total CPU time: 611.08 seconds 0.17 hours 4:2 mdout.326208:| Master Total CPU time: 537.97 seconds 0.15 hours 36:4 ==== Miscellaneous ==== Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource ''gpu4''. Add nodes to ZenOSS hpcmon. Propagate global ''known_hosts'' files in users ~/.ssh/ dirs. Look at how accounting ties in with resource request ''gpu4='' versus ''gpu='' ... # propagate global passwd, shadow, group, hosts file # add to date_ctt2.sh script, get and set date NOW=`/bin/date +%m%d%H%M%Y.%S` for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done # crontab # ionice gaussian 0,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh > /dev/null 2>&1 # cpu temps 40 * * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&1 # rc.local, chmod o+x /etc/rc.d/rc.local, then add # for mapd, 'All On' enable graphicsrendering support #/usr/bin/nvidia-smi --gom=0 # for amber16 -pm=1/ENABLED -c=1/EXCLUSIVE_PROCESS #nvidia-smi --persistence-mode=1 #nvidia-smi --compute-mode=1 # for mwgpu/exx96 -pm=1/ENABLED -c=0/DEFAULT # note: turned this off, running with defaults # seems stable, maybe persistence later on # lets see how docker interacts first... #nvidia-smi --persistence-mode=1 #nvidia-smi --compute-mode=0 # turn ECC off (memory scrubbing) #/usr/bin/nvidia-smi -e 0 # lm_sensor modprobe coretemp modprobe tmp401 #modprobe w83627ehf reboot ==== Recipe ==== Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster:172|K20 Redo]] page. First we add these packages and clean up. # hook up VDI-D cable to GPU port (offboard video) # login as root check some things out... free -g nvidia-smi docker images docker ps # set local time zone mv /etc/localtime /etc/localtime.backup ln -s /usr/share/zoneinfo/America/New_York /etc/localtime # change passwords for root and vendor account passwd passwd exx # set hostname hostnamectl set-hostname n79 # configure private subnets and ping file server cd /etc/sysconfig/network-scripts/ vi ifcfg-enp1s0f0 vi ifcfg-enp1s0f1 systemctl restart network ping -c 3 192.168.102.42 ping -c 3 10.10.102.42 # make internet connection for yum ifdown enp1s0f0 vi ifcfg-enp1s0f0 systemctl restart network dig google.com #centos7 yum install -y iptables-services vi /etc/sysconfig/iptables systemctl start iptables iptables -L systemctl stop firewalld systemctl disable firewalld # other configs vi /etc/selinux/config (disabled) mv /home /usr/local/ mkdir /home vi /etc/passwd (exx, dockeruser $HOME) mkdir /sanscratch /localscratch chmod ugo+rwx /sanscratch /localscratch chmod o+t /sanscratch /localscratch ln -s /home /share ssh-keygen -t rsa scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/ /etc/ssh/sshd_config (PermitRootLogin) echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf # add packages and update yum install epel-release -y yum install flex flex-devel bison bison-devel -y yum install tcl tcl-devel dmtcp -y yum install net-snmp net-snmp-libs net-agent-libs net-tools net-snmp-utils -y yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y yum install blas blas-devel lapack lapack-devel boost boost-devel -y yum install tkinter lm_sensors lm_sensors-libs -y yum install zlib-devel bzip2-devel bzip bzip-devel -y yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker -y yum install cmake cmake-devel -y yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y # amber yum -y install tcsh make \ gcc gcc-gfortran gcc-c++ \ which flex bison patch bc \ libXt-devel libXext-devel \ perl perl-ExtUtils-MakeMaker util-linux wget \ bzip2 bzip2-devel zlib-devel tar yum update -y yum clean all # remove internet, bring private back up ifdown enp1s0f0 vi ifcfg-enp1s0f0 ifup enp1s0f0 # passwd, shadow, group, hosts, fstab mkdir /homeextra1 /homeextra2 /home33 /mindstore cd /etc/ # backup files to -orig versions scp 192.168.102.89:/etc/passwd /etc/passwd (and others) scp 10.10.102.89:/etc/fstab /tmp vi /etc/fstab mount -a; df -h # pick the kernel vendor used for now grep ^menuentry /etc/grub2.cfg grub2-set-default 1 ls -d /sys/firmware/efi && echo "EFI" || echo "Legacy" grub2-mkconfig -o /boot/grub2/grub.cfg #grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg # old level 3 systemctl set-default multi-user.target reboot # switch to VGA cd /usr/local/src/ tar zxf n37.chroot-keep.ul.tar.gz cd usr/local/ mv amber16/ fsl-5.0.10/ gromacs-2018/ lammps-22Aug18/ /usr/local/ mv cuda-9.2/ /usr/local/n37-cuda-9.2/ cd /usr/local/bin/ rsync -vac 10.10.102.89:/usr/local/bin/ /usr/local/bin/ # test scripts gpu-free, gpu-info, gpu-process 0,1,2,3 id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem 0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 % 1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 % 2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % 3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % gpu_name, gpu_bus_id, pid, process_name GeForce RTX 2080 SUPER, 00000000:3B:00.0, 3109, python # done ==== What We Purchased ==== * 12 nodes yielding a total of * 24 cpus * 288 cpu cores * 1,152 gb cpu mem * ~20 Tflops (dpfp) * 48 gpus * 384 gpu mem * ~700 Tflops (mixed mode) # docker images REPOSITORY TAG IMAGE ID CREATED SIZE nvcr.io/nvidia/cuda 10.1-devel 9e47e9dfcb9a 2 months ago 2.83GB portainer/portainer latest ff4ee4caaa23 2 months ago 81.6MB nvidia/cuda 9.2-devel 1874839f75d5 2 months ago 2.35GB nvcr.io/nvidia/cuda 9.2-devel 1874839f75d5 2 months ago 2.35GB nvcr.io/nvidia/cuda 10.0-devel f765411c4ae6 2 months ago 2.29GB nvcr.io/nvidia/digits 19.09-tensorflow b08982c9545c 4 months ago 8.85GB nvcr.io/nvidia/tensorflow 19.09-py2 b82bcb185286 4 months ago 7.88GB nvcr.io/nvidia/pytorch 19.09-py3 9d6f9ccfbe31 5 months ago 9.15GB nvcr.io/nvidia/caffe 19.09-py2 b52fbbef7e6b 5 months ago 5.15GB nvcr.io/nvidia/rapidsai/rapidsai 0.9-cuda10.0-runtime-centos7 22b5dc2f7e84 5 months ago 5.84GB free -m total used free shared buff/cache available Mem: 95056 1919 85338 20 7798 92571 Swap: 10239 0 10239 # nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.31 Driver Version: 440.31 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A | | 24% 24C P8 8W / 250W | 275MiB / 7981MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:5E:00.0 Off | N/A | | 25% 24C P8 10W / 250W | 12MiB / 7982MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A | | 24% 23C P8 4W / 250W | 12MiB / 7982MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce RTX 208... Off | 00000000:D8:00.0 Off | N/A | | 25% 22C P8 13W / 250W | 12MiB / 7982MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3127 C python 115MiB | | 0 5715 G /usr/bin/X 84MiB | | 0 6307 G /usr/bin/gnome-shell 70MiB | +-----------------------------------------------------------------------------+ # df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 47G 0 47G 0% /dev tmpfs 47G 0 47G 0% /dev/shm tmpfs 47G 13M 47G 1% /run tmpfs 47G 0 47G 0% /sys/fs/cgroup /dev/nvme0n1p3 929G 42G 840G 5% / /dev/nvme0n1p1 477M 199M 249M 45% /boot overlay 929G 42G 840G 5% /var/lib/docker/overlay2 /6f7af00a8eb8b5ede68fd6bc9be5f7220525bdde21c14e6f1643a2a7debc454b/merged overlay 929G 42G 840G 5% /var/lib/docker/overlay2 /9cf895d8a17106f16ba997e6025f5912abb988a512779caa2c35e2da3e7d196a/merged tmpfs 9.3G 28K 9.3G 1% /run/user/0 # yum repolist repo id repo name status base/7/x86_64 CentOS-7 - Base 10,097 docker-ce-stable/x86_64 Docker CE Stable - x86_64 63 extras/7/x86_64 CentOS-7 - Extras 323 libnvidia-container libnvidia-container 65 nvidia-container-runtime nvidia-container-runtime 54 nvidia-docker nvidia-docker 50 updates/7/x86_64 CentOS-7 - Updates 1,117 ==== Pictures ==== {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\ {{:cluster:hdmi_small.JPG?nolink&300|}} ports on gpu \\ {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\ {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\ {{:cluster:front_small.JPG?nolink&300|}} Front, all drive bays empty \\ {{:cluster:rack_small.JPG?nolink&300|}} Racking \\ {{:cluster:boxes_small.JPG?nolink&300|}} Boxes \\ \\ **[[cluster:0|Back]]**