Differences

This shows you the differences between two versions of the page.

--- cluster:192 [2020/02/17 19:46]
hmeij07 [Pictures]
+++ cluster:192 [2020/02/26 20:05]
hmeij07 [EXX96]
@@ Line 4: / Line 4: @@
 ===== EXX96 =====
-A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]
+A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which was the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]
-==== WhatWeDo? ====
+Page best followed bottom to top.
+==== Usage ====
+The new queue ''exx96'' will be comprised of nodes ''n79-n90''.  Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz 12-core cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GB.
+A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' and the nodes mentioned above.  The name of this resource is ''gpu4''.  Moving forward please use it instead of ''gpu'' or ''gputest''.
+The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1''.  If your ratio is something else you can set CPU_GPU_REQUEST. For example CPU_GPU_REQUEST=4:2 expects the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. Sample script at ''/home/hmeij/k20redo/run.rtx''
+The wrappers (78.mpich3.wrapper for ''n78'', and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up your environment and start either of these applications: amber, lammps, gromacs, matlab and namd from ''/usr/local''.
+<code>
+# command that shows gpu reservations
+bhosts -l n79
+             gputest gpu4
+ Total             0    3
+ Reserved        0.0  1.0
+# old way of doing that
+lsload -l n79
+HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu
+n79                         ok   0.0   0.0   0.0   0%   0.0     0   0 2e+08  826G   10G   90G    3.0
+</code>
+Peer to peer communication is possible (via PCIe rather than NVlink) with this hardware.  This will get rather messy in setting up.  Some quick off the cuff performance data reveals some impact. Generally in our environment the gains are not worth the effort.  Using Amber and ''pmemd.cuda.MPI''
+<code>
+                                                                              cpu:gpu
+mdout.325288:|  Master Total CPU time:          982.60 seconds     0.27 hours   1:1
+mdout.325289:|  Master Total CPU time:          611.08 seconds     0.17 hours   4:2
+mdout.326208:|  Master Total CPU time:          537.97 seconds     0.15 hours  36:4
+</code>
+==== Miscellaneous ====
+Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper.
+Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource ''gpu4''.
+Add nodes to ZenOSS hpcmon.
+Propagate global ''known_hosts'' files in users ~/.ssh/ dirs.
+Look at how accounting ties in with resource request ''gpu4='' versus ''gpu='' ...
+<code>
+# propagate global passwd, shadow, group, hosts file
+# add to date_ctt2.sh script, get and set date
+NOW=`/bin/date +%m%d%H%M%Y.%S`
+for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done
+# crontab
+# ionice gaussian
+,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh  > /dev/null 2>&1
+# cpu temps
+* * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&1
+# rc.local, chmod o+x /etc/rc.d/rc.local, then add
+# for mapd, 'All On' enable graphicsrendering support
+#/usr/bin/nvidia-smi --gom=0
+# for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS
+#nvidia-smi --persistence-mode=1
+#nvidia-smi --compute-mode=1
+# for mwgpu/exx96 -pm=ENABLED -c=DEFAULT
+nvidia-smi --persistence-mode=1
+nvidia-smi --compute-mode=0
+# turn ECC off (memory scrubbing)
+#/usr/bin/nvidia-smi -e 0
+# lm_sensor
+modprobe coretemp
+modprobe tmp401
+#modprobe w83627ehf
+reboot
+</code>
+==== Recipe ====
 Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster:172|K20 Redo]] page.  First we add these packages and clean up.
@@ Line 12: / Line 104: @@
 <code>
-yum install epel-release
+# hook up VDI-D cable to GPU port (offboard video)
-yum install tcl tcl-devel dmtcp
+# login as root check some things out...
-yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel
+free -g
-yum install blas blas-devel lapack lapack-devel boost boost-devel
+nvidia-smi
-yum install tkinter lm_sensors lm_sensors-libs
+docker images
-yum install zlib-devel bzip2-devel bzip bzip-devel
+docker ps
-yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker
+# set local time zone
-yum install cmake cmake-devel
+mv /etc/localtime /etc/localtime.backup
-yum install libjpeg libjpeg-devel libjpeg-turbo-devel
+ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
+# change passwords for root and vendor account
+passwd
+passwd exx
+# set hostname
+hostnamectl set-hostname n79
+# configure private subnets and ping file server
+cd /etc/sysconfig/network-scripts/
+vi ifcfg-enp1s0f0
+vi ifcfg-enp1s0f1
+systemctl restart network
+ping -c 3 192.168.102.42
+ping -c 3 10.10.102.42
+# make internet connection for yum
+ifdown enp1s0f0
+vi ifcfg-enp1s0f0
+systemctl restart network
+dig google.com
+yum install -y iptables-services
+vi /etc/sysconfig/iptables
+systemctl start iptables
+iptables -L
+systemctl stop firewalld
+systemctl disable firewalld
+# other configs
+vi /etc/selinux/config (disabled)
+mv /home /usr/local/
+mkdir /home
+vi /etc/passwd (exx, dockeruser $HOME)
+mkdir /sanscratch /localscratch
+chmod ugo+rwx /sanscratch /localscratch
+chmod o+t /sanscratch /localscratch
+ln -s /home /share
+ssh-keygen -t rsa
+scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/
+/etc/ssh/sshd_config (PermitRootLogin)
+echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf
+# add packages and update
+yum install epel-release -y
+yum install tcl tcl-devel dmtcp -y
+yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y
+yum install blas blas-devel lapack lapack-devel boost boost-devel -y
+yum install tkinter lm_sensors lm_sensors-libs -y
+yum install zlib-devel bzip2-devel bzip bzip-devel -y
+yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker -y
+yum install cmake cmake-devel -y
+yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y
+yum update -y
 yum clean all
+# remove internet, bring private back up
+ifdown enp1s0f0
+vi ifcfg-enp1s0f0
+ifup enp1s0f0
+# passwd, shadow, group, hosts, fstab
+mkdir /homeextra1 /homeextra2 /home33 /mindstore
+cd /etc/
+# backup files to -orig versions
+scp 192.168.102.89:/etc/passwd /etc/passwd (and others)
+scp 10.10.102.89:/etc/fstab /tmp
+vi /etc/fstab
+mount -a; df -h
+# pick the kernel vendor used for now
+grep ^menuentry /etc/grub2.cfg
+grub2-set-default 1
+ls -d /sys/firmware/efi && echo "EFI" || echo "Legacy"
+grub2-mkconfig -o /boot/grub2/grub.cfg
+#grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
+# old level 3
+systemctl set-default multi-user.target
+reboot
+# switch to VGA
+cd /usr/local/src/
+tar zxf n37.chroot-keep.ul.tar.gz
+cd usr/local/
+mv amber16/  fsl-5.0.10/ gromacs-2018/ lammps-22Aug18/ /usr/local/
+mv cuda-9.2/ /usr/local/n37-cuda-9.2/
+cd /usr/local/bin/
+rsync -vac 10.10.102.89:/usr/local/bin/ /usr/local/bin/
+# test scripts gpu-free, gpu-info, gpu-process
+,1,2,3
+id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
+, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 %
+, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 %
+, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
+, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 %
+gpu_name, gpu_bus_id, pid, process_name
+GeForce RTX 2080 SUPER, 00000000:3B:00.0, 3109, python
+# done
 </code>
-==== WhatWeGot? ====
+==== What We Purchased ====
   * 12 nodes yielding a total of
@@ Line 117: / Line 295: @@
 {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\
-{{:cluster:hdmi_small.JPG?nolink&300|}} HDMI ports on gpu \\
+{{:cluster:hdmi_small.JPG?nolink&300|}} ports on gpu \\
 {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\
 {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\

DokuWiki

User Tools

Site Tools

Differences

Page Tools