This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:192 [2020/02/06 15:41] hmeij07 [WhatWeDo?] |
cluster:192 [2020/02/26 19:45] hmeij07 [Usage] |
||
---|---|---|---|
Line 6: | Line 6: | ||
A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster: | A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster: | ||
- | ==== WhatWeDo? | + | Page best followed bottom to top. |
+ | |||
+ | ==== Usage ==== | ||
+ | |||
+ | The new queue '' | ||
+ | |||
+ | A new static resource is introduced for all nodes holding gpus. '' | ||
+ | |||
+ | The wrappers provided assume your cpu:gpu ratio is 1:1 hence in your submit code you will have ''# | ||
+ | |||
+ | The wrappers (78.mpich3.wrapper for '' | ||
+ | |||
+ | |||
+ | < | ||
+ | |||
+ | # command that shows gpu reservations | ||
+ | bhosts -l n79 | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | # old way of doing that | ||
+ | lsload -l n79 | ||
+ | |||
+ | HOST_NAME | ||
+ | n79 | ||
+ | |||
+ | </ | ||
+ | ==== Miscellaneous ==== | ||
+ | |||
+ | Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. | ||
+ | |||
+ | Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource '' | ||
+ | |||
+ | Add nodes to ZenOSS hpcmon. | ||
+ | |||
+ | Propagate global '' | ||
+ | |||
+ | Look at how accounting ties in with resource request '' | ||
+ | |||
+ | < | ||
+ | |||
+ | # propagate global passwd, shadow, group, hosts file | ||
+ | |||
+ | # add to date_ctt2.sh script, get and set date | ||
+ | |||
+ | NOW=`/ | ||
+ | for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done | ||
+ | |||
+ | # crontab | ||
+ | |||
+ | # ionice gaussian | ||
+ | 0,15,30,45 * * * * / | ||
+ | |||
+ | # cpu temps | ||
+ | 40 * * * * / | ||
+ | |||
+ | # rc.local, chmod o+x / | ||
+ | |||
+ | # for mapd, 'All On' enable graphicsrendering support | ||
+ | #/ | ||
+ | |||
+ | # for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS | ||
+ | #nvidia-smi --persistence-mode=1 | ||
+ | #nvidia-smi --compute-mode=1 | ||
+ | |||
+ | # for mwgpu/exx96 -pm=ENABLED -c=DEFAULT | ||
+ | nvidia-smi --persistence-mode=1 | ||
+ | nvidia-smi --compute-mode=0 | ||
+ | |||
+ | # turn ECC off (memory scrubbing) | ||
+ | #/ | ||
+ | |||
+ | # lm_sensor | ||
+ | modprobe coretemp | ||
+ | modprobe tmp401 | ||
+ | #modprobe w83627ehf | ||
+ | |||
+ | reboot | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Recipe | ||
Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster: | Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster: | ||
Line 12: | Line 94: | ||
< | < | ||
- | yum install tcl tcl-devel dmtcp | + | # hook up VDI-D cable to GPU port (offboard video) |
- | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel | + | # login as root check some things out... |
- | yum install blas blas-devel lapack lapack-devel boost boost-devel | + | free -g |
- | yum install tkinter lm_sensors lm_sensors-libs | + | nvidia-smi |
- | yum install zlib-devel bzip2-devel bzip bzip-devel | + | docker images |
- | yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker | + | docker ps |
- | yum install cmake cmake-devel | + | # set local time zone |
- | yum install libjpeg libjpeg-devel libjpeg-turbo-devel | + | mv / |
+ | ln -s / | ||
+ | # change passwords for root and vendor account | ||
+ | passwd | ||
+ | passwd exx | ||
+ | # set hostname | ||
+ | hostnamectl set-hostname n79 | ||
+ | # configure private subnets and ping file server | ||
+ | cd / | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | vi ifcfg-enp1s0f1 | ||
+ | systemctl restart network | ||
+ | ping -c 3 192.168.102.42 | ||
+ | ping -c 3 10.10.102.42 | ||
+ | # make internet connection for yum | ||
+ | ifdown enp1s0f0 | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | systemctl restart network | ||
+ | dig google.com | ||
+ | yum install -y iptables-services | ||
+ | vi / | ||
+ | systemctl start iptables | ||
+ | iptables -L | ||
+ | systemctl stop firewalld | ||
+ | systemctl disable firewalld | ||
+ | # other configs | ||
+ | vi / | ||
+ | mv /home / | ||
+ | mkdir /home | ||
+ | vi /etc/passwd (exx, dockeruser $HOME) | ||
+ | mkdir /sanscratch / | ||
+ | chmod ugo+rwx /sanscratch / | ||
+ | chmod o+t /sanscratch / | ||
+ | ln -s /home /share | ||
+ | ssh-keygen -t rsa | ||
+ | scp 10.10.102.253:/ | ||
+ | / | ||
+ | echo " | ||
+ | # add packages and update | ||
+ | yum install epel-release -y | ||
+ | yum install tcl tcl-devel dmtcp -y | ||
+ | yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel | ||
+ | yum install blas blas-devel lapack lapack-devel boost boost-devel | ||
+ | yum install tkinter lm_sensors lm_sensors-libs | ||
+ | yum install zlib-devel bzip2-devel bzip bzip-devel | ||
+ | yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker | ||
+ | yum install cmake cmake-devel | ||
+ | yum install libjpeg libjpeg-devel libjpeg-turbo-devel | ||
+ | yum update -y | ||
yum clean all | yum clean all | ||
+ | # remove internet, bring private back up | ||
+ | ifdown enp1s0f0 | ||
+ | vi ifcfg-enp1s0f0 | ||
+ | ifup enp1s0f0 | ||
+ | # passwd, shadow, group, hosts, fstab | ||
+ | mkdir /homeextra1 /homeextra2 /home33 /mindstore | ||
+ | cd /etc/ | ||
+ | # backup files to -orig versions | ||
+ | scp 192.168.102.89:/ | ||
+ | scp 10.10.102.89:/ | ||
+ | vi /etc/fstab | ||
+ | mount -a; df -h | ||
+ | # pick the kernel vendor used for now | ||
+ | grep ^menuentry / | ||
+ | grub2-set-default 1 | ||
+ | ls -d / | ||
+ | grub2-mkconfig -o / | ||
+ | # | ||
+ | # old level 3 | ||
+ | systemctl set-default multi-user.target | ||
+ | reboot | ||
+ | # switch to VGA | ||
+ | cd / | ||
+ | tar zxf n37.chroot-keep.ul.tar.gz | ||
+ | cd usr/local/ | ||
+ | mv amber16/ | ||
+ | mv cuda-9.2/ / | ||
+ | cd / | ||
+ | rsync -vac 10.10.102.89:/ | ||
+ | # test scripts gpu-free, gpu-info, gpu-process | ||
+ | 0,1,2,3 | ||
+ | id, | ||
+ | 0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 % | ||
+ | 1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | 2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | 3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % | ||
+ | gpu_name, gpu_bus_id, pid, process_name | ||
+ | GeForce RTX 2080 SUPER, 00000000: | ||
+ | # done | ||
</ | </ | ||
- | ==== WhatWeGot? | + | ==== What We Purchased |
* 12 nodes yielding a total of | * 12 nodes yielding a total of | ||
- | * +24 cpus | + | * 24 cpus |
- | * +288 cpu cores | + | * 288 cpu cores |
- | * +1,152 gb cpu mem | + | * 1,152 gb cpu mem |
- | * +48 gpus | + | * ~20 Tflops (dpfp) |
- | * +384 gpu mem | + | * 48 gpus |
- | * these rtx gpus will add 695 Tflops | + | * 384 gpu mem |
- | * blows me away | + | * ~700 Tflops |
< | < | ||
Line 116: | Line 285: | ||
{{: | {{: | ||
- | {{: | + | {{: |
{{: | {{: | ||
{{: | {{: | ||
{{: | {{: | ||
+ | {{: | ||
+ | {{: | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |