User Tools

Site Tools


cluster:192

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
cluster:192 [2020/02/21 15:50]
hmeij07
cluster:192 [2020/02/26 18:34]
hmeij07 [Usage]
Line 6: Line 6:
 A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]] A page for me on how these 12 nodes were build up after they arrived. To make them "ala n37" which as the test node in redoing our K20 nodes, see [[cluster:172|K20 Redo]]
  
-Page best read bottom to top.+Page best followed bottom to top. 
 + 
 +==== Usage ==== 
 + 
 +The new queue ''exx96'' will be comprised of nodes ''n79-n90'' Each node holds 4x RTX2080S gpus, 2x Xeon Silver 4214 2.2 Ghz cpus, 96 GB memory and a 1TB SSD. ''/localscratch'' is around 800 GD. 
 + 
 +A new static resource is introduced for all nodes holding gpus. ''n78'' in queue ''amber128'' and ''n33-n37'' in queue ''mwgpu'' The name of this resource is ''gpu4'' Moving forward please use it instaed of ''gpu'' or ''gputest''
 + 
 +The wrappers provided assume your cpu:gpu ration is 1:1 hence in your submit code you will have ''#BSUB -n 1'' and in your resource allocation line ''gpu4=1'' If your ratio is something else you can set CPU_GPU_REQUEST, for example CPU_GPU_REQUEST=4:2 which expectas the lines ''#BSUB -n 4'' and ''gpu4=2'' in your submit script. 
 + 
 +The wrappers (78.mpich3.wrapper for n78, and n37.openmpi.wrapper for all others) are located in ''/usr/local/bin'' and will set up the environment and start these applications: amber, lammps, gromacs, matlab and namd. 
 +  
 + 
 +<code> 
 +bhosts -l n79 
 +             gputest gpu4 
 + Total                3 
 + Reserved        0.0  0.1 
 + 
 +lsload -l n79 
 + 
 +HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu 
 +n79                         ok   0.0   0.0   0.0   0%   0.0       0 2e+08  826G   10G   90G    3.0 
 + 
 +mdout.325288: Master Total CPU time:          982.60 seconds     0.27 hours  1:1 
 +mdout.325289: Master Total CPU time:          611.08 seconds     0.17 hours  4:2 
 +mdout.326208: Master Total CPU time:          537.97 seconds     0.15 hours 36:4 
 + 
 +#BSUB -n 4 
 +#BSUB -R "rusage[gpu4=2:mem=6288],span[hosts=1]" 
 +export CPU_GPU_REQUEST=4:
 + 
 +</code> 
 +==== Miscellaneous ==== 
 + 
 +Install scheduler RPM for CentOS7, reconfigure (hosts, queue, static resource), elim. Test it out with old wrapper. 
 + 
 +Edit the n37.openmpi.wrapper for n33-n37 and n79-90 and the one on n78 for the new static resource ''gpu4''
 + 
 +Add nodes to ZenOSS hpcmon. 
 + 
 +Propagate global ''known_hosts'' files in users ~/.ssh/ dirs. 
 + 
 +Look at how accounting ties in with resource request ''gpu4='' versus ''gpu='' ... 
 + 
 +<code> 
 + 
 +# propagate global passwd, shadow, group, hosts file 
 + 
 +# add to date_ctt2.sh script, get and set date 
 + 
 +NOW=`/bin/date +%m%d%H%M%Y.%S` 
 +for i in `seq 79 90`; do echo n$i; ssh n$i date $NOW; done 
 + 
 +# crontab 
 + 
 +# ionice gaussian 
 +0,15,30,45 * * * * /share/apps/scripts/ionice_lexes.sh  > /dev/null 2>&
 + 
 +# cpu temps 
 +40 * * * * /share/apps/scripts/lm_sensors.sh > /dev/null 2>&
 +  
 +# rc.local, chmod o+x /etc/rc.d/rc.local, then add 
 + 
 +# for mapd, 'All On' enable graphicsrendering support 
 +#/usr/bin/nvidia-smi --gom=0 
 + 
 +# for amber16 -pm=ENABLED -c=EXCLUSIVE_PROCESS 
 +#nvidia-smi --persistence-mode=1 
 +#nvidia-smi --compute-mode=1 
 + 
 +# for mwgpu/exx96 -pm=ENABLED -c=DEFAULT 
 +nvidia-smi --persistence-mode=1 
 +nvidia-smi --compute-mode=0 
 + 
 +# turn ECC off (memory scrubbing) 
 +#/usr/bin/nvidia-smi -e 0 
 + 
 +# lm_sensor 
 +modprobe coretemp 
 +modprobe tmp401 
 +#modprobe w83627ehf 
 +  
 +reboot 
 + 
 +</code>
  
 ==== Recipe ==== ==== Recipe ====
Line 205: Line 290:
  
 {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\ {{:cluster:ssd_small.JPG?nolink&300|}} Yea, found 1T SSD \\
-{{:cluster:hdmi_small.JPG?nolink&300|}} HDMI ports on gpu \\+{{:cluster:hdmi_small.JPG?nolink&300|}} ports on gpu \\
 {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\ {{:cluster:gpu_small.JPG?nolink&300|}} GPU detail, blower model \\
 {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\ {{:cluster:back_small.JPG?nolink&300|}} Back, gpus stacked 2 on 2 \\
cluster/192.txt · Last modified: 2022/03/08 18:29 by hmeij07