User Tools

Site Tools


cluster:192

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
cluster:192 [2020/02/06 15:37]
hmeij07 created
cluster:192 [2020/02/26 18:34]
hmeij07 [Usage]
Line 6: Line 6:
 A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]
  
-==== WhatWeDo? ====+Page best followed bottom to top.
  
-Steps.+==== Usage ====
  
-==== WhatWeGot? ====+The new queue ''exx96'' will be comprised of nodes ''n79-n90'' Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GD. 
 + 
 +A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' The name of this resource is ''gpu4'' Moving forward please use it instaed of ''gpu'' or ''gputest''
 + 
 +The wrappers provided assume your cpu:gpu ration is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1'' If your ratio is something else you can set CPU_GPU_REQUEST, for example CPU_GPU_REQUEST=4:2 which expectas the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. 
 + 
 +The wrappers (78.mpich3.wrapper for n78, and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up the environment and start these applications: amber, lammps, gromacs, matlab and namd. 
 +  
 + 
 +<code> 
 +bhosts -l n79 
 +             gputest gpu4 
 + Total                3 
 + Reserved        0.0  0.1 
 + 
 +lsload -l n79 
 + 
 +HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu 
 +n79                         ok   0.0   0.0   0.0   0%   0.0       0 2e+08  826G   10G   90G    3.0 
 + 
 +mdout.325288: Master Total CPU time:          982.60 seconds     0.27 hours  1:1 
 +mdout.325289: Master Total CPU time:          611.08 seconds     0.17 hours  4:2 
 +mdout.326208: Master Total CPU time:          537.97 seconds     0.15 hours 36:4 
 + 
 +#BSUB -n 4 
 +#BSUB -R "rusage[gpu4=2:mem=6288],span[hosts=1]" 
 +export CPU_GPU_REQUEST=4:
 + 
 +</code> 
 +==== Miscellaneous ==== 
 + 
 +Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. 
 + 
 +Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource ''gpu4''
 + 
 +Add nodes to ZenOSS hpcmon. 
 + 
 +Propagate global ''known_hosts'' files in users ~/.ssh/ dirs. 
 + 
 +Look at how accounting ties in with resource request ''gpu4='' versus ''gpu='' ... 
 + 
 +<code> 
 + 
 +# propagate global passwd, shadow, group, hosts file 
 + 
 +# add to date_ctt2.sh script, get and set date 
 + 
 +NOW=`/bin/date +%m%d%H%M%Y.%S` 
 +for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done 
 + 
 +# crontab 
 + 
 +# ionice gaussian 
 +0,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh  > /dev/null 2>&
 + 
 +# cpu temps 
 +40 * * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&
 +  
 +# rc.local, chmod o+x /etc/rc.d/rc.local, then add 
 + 
 +# for mapd, 'All On' enable graphicsrendering support 
 +#/usr/bin/nvidia-smi --gom=0 
 + 
 +# for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS 
 +#nvidia-smi --persistence-mode=1 
 +#nvidia-smi --compute-mode=1 
 + 
 +# for mwgpu/exx96 -pm=ENABLED -c=DEFAULT 
 +nvidia-smi --persistence-mode=1 
 +nvidia-smi --compute-mode=0 
 + 
 +# turn ECC off (memory scrubbing) 
 +#/usr/bin/nvidia-smi -e 0 
 + 
 +# lm_sensor 
 +modprobe coretemp 
 +modprobe tmp401 
 +#modprobe w83627ehf 
 +  
 +reboot 
 + 
 +</code> 
 + 
 +==== Recipe ==== 
 + 
 +Steps. "Ala n37" ... so the RTX nodes are similar to the K20 nodes and we can put the local software in place. See [[cluster:172|K20 Redo]] page.  First we add these packages and clean up. 
 + 
 +<code> 
 + 
 +# hook up VDI-D cable to GPU port (offboard video) 
 +# login as root check some things out... 
 +free -g 
 +nvidia-smi 
 +docker images 
 +docker ps 
 +# set local time zone 
 +mv /etc/localtime /etc/localtime.backup 
 +ln -s /usr/share/zoneinfo/America/New_York /etc/localtime 
 +# change passwords for root and vendor account 
 +passwd 
 +passwd exx 
 +# set hostname 
 +hostnamectl set-hostname n79 
 +# configure private subnets and ping file server 
 +cd /etc/sysconfig/network-scripts/ 
 +vi ifcfg-enp1s0f0 
 +vi ifcfg-enp1s0f1 
 +systemctl restart network 
 +ping -c 3 192.168.102.42 
 +ping -c 3 10.10.102.42 
 +# make internet connection for yum 
 +ifdown enp1s0f0 
 +vi ifcfg-enp1s0f0 
 +systemctl restart network 
 +dig google.com 
 +yum install -y iptables-services 
 +vi /etc/sysconfig/iptables 
 +systemctl start iptables 
 +iptables -L 
 +systemctl stop firewalld 
 +systemctl disable firewalld 
 +# other configs 
 +vi /etc/selinux/config (disabled) 
 +mv /home /usr/local/ 
 +mkdir /home 
 +vi /etc/passwd (exx, dockeruser $HOME) 
 +mkdir /sanscratch /localscratch 
 +chmod ugo+rwx /sanscratch /localscratch 
 +chmod o+t /sanscratch /localscratch 
 +ln -s /home /share 
 +ssh-keygen -t rsa 
 +scp 10.10.102.253:/root/.ssh/authorized_keys /root/.ssh/ 
 +/etc/ssh/sshd_config (PermitRootLogin) 
 +echo "relayhost = 192.168.102.42" >> /etc/postfix/main.cf 
 +# add packages and update 
 +yum install epel-release -y 
 +yum install tcl tcl-devel dmtcp -y 
 +yum install freeglut-devel libXi-devel libXmu-devel \ make mesa-libGLU-devel -y 
 +yum install blas blas-devel lapack lapack-devel boost boost-devel -y 
 +yum install tkinter lm_sensors lm_sensors-libs -y 
 +yum install zlib-devel bzip2-devel bzip bzip-devel -y 
 +yum install openmpi openmpi-devel perl-ExtUtils-MakeMaker -y 
 +yum install cmake cmake-devel -y 
 +yum install libjpeg libjpeg-devel libjpeg-turbo-devel -y 
 +yum update -y 
 +yum clean all 
 +# remove internet, bring private back up 
 +ifdown enp1s0f0 
 +vi ifcfg-enp1s0f0 
 +ifup enp1s0f0 
 +# passwd, shadow, group, hosts, fstab 
 +mkdir /homeextra1 /homeextra2 /home33 /mindstore 
 +cd /etc/ 
 +# backup files to -orig versions 
 +scp 192.168.102.89:/etc/passwd /etc/passwd (and others) 
 +scp 10.10.102.89:/etc/fstab /tmp 
 +vi /etc/fstab 
 +mount -a; df -h 
 +# pick the kernel vendor used for now 
 +grep ^menuentry /etc/grub2.cfg 
 +grub2-set-default 1 
 +ls -d /sys/firmware/efi && echo "EFI" || echo "Legacy" 
 +grub2-mkconfig -o /boot/grub2/grub.cfg 
 +#grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg 
 +# old level 3 
 +systemctl set-default multi-user.target 
 +reboot 
 +# switch to VGA 
 +cd /usr/local/src/ 
 +tar zxf n37.chroot-keep.ul.tar.gz 
 +cd usr/local/ 
 +mv amber16/  fsl-5.0.10/ gromacs-2018/ lammps-22Aug18/ /usr/local/ 
 +mv cuda-9.2/ /usr/local/n37-cuda-9.2/ 
 +cd /usr/local/bin/ 
 +rsync -vac 10.10.102.89:/usr/local/bin/ /usr/local/bin/ 
 +# test scripts gpu-free, gpu-info, gpu-process 
 +0,1,2,3 
 +id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem 
 +0, GeForce RTX 2080 SUPER, 25, 126 MiB, 7855 MiB, 0 %, 0 % 
 +1, GeForce RTX 2080 SUPER, 24, 11 MiB, 7971 MiB, 0 %, 0 % 
 +2, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % 
 +3, GeForce RTX 2080 SUPER, 23, 11 MiB, 7971 MiB, 0 %, 0 % 
 +gpu_name, gpu_bus_id, pid, process_name 
 +GeForce RTX 2080 SUPER, 00000000:3B:00.0, 3109, python 
 +# done 
 + 
 +</code> 
 + 
 +==== What We Purchased ====
  
   * 12 nodes yielding a total of   * 12 nodes yielding a total of
-    * +24 cpus  +    * 24 cpus  
-    * +288 cpu cores  +    * 288 cpu cores  
-    * +1,152 gb cpu mem +    * 1,152 gb cpu mem 
-    * +48 gpus  +    * ~20 Tflops (dpfp) 
-    * +384 gpu mem +    * 48 gpus  
-  these rtx gpus will add 695 Tflops of "mixed mode" computational capacity. +    * 384 gpu mem 
-    * blows me away+    ~700 Tflops (mixed mode)
  
 <code> <code>
Line 102: Line 290:
  
 {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\ {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\
-{{:cluster:hdmi_small.JPG?nolink&300|}} HDMI ports on gpu \\+{{:cluster:hdmi_small.JPG?nolink&300|}} ports on gpu \\
 {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\ {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\
 {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\ {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\
 {{:cluster:front_small.JPG?nolink&300|}} Front, all drive bays empty \\ {{:cluster:front_small.JPG?nolink&300|}} Front, all drive bays empty \\
 +{{:cluster:rack_small.JPG?nolink&300|}} Racking \\
 +{{:cluster:boxes_small.JPG?nolink&300|}} Boxes \\
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/192.txt · Last modified: 2022/03/08 18:29 by hmeij07