\\
**[[cluster:0|Back]]**

Update 
 --- //[[hmeij@wesleyan.edu|Henk]] 2021/02/12 14:27//


----

For CUDA_ARCH (or ''nvcc -arch'') versions check this [[http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/|Matching CUDA arch and CUDA gencode for various NVIDIA architectures]] web page. "When you compile CUDA code, you should always compile only one ‘-arch‘ flag that matches your most used GPU cards. This will enable faster runtime, because code generation will occur during compilation." //All Turing gpu models RTX2080, RTX5000 and RTX6000 use CUDA_ARCH sm_75// The former model is consumer grade, the latter two  models are enterprise grade. See performance differences below. The consumer grade RTX3060Ti is CUDA ARCH sm_86 (Ampere).

----

A detailed review and comparison of GEForce gpus, including the Quadro RTX 5000 and RTX 2080 (Ti and S) can be found at this[[https://www.servethehome.com/nvidia-quadro-rtx-5000-review-gpu/|NVIDIA Quadro RTX 5000 Review The Balanced Quadro GPU]] website. Deep Learning oriented performance results showing most of the applicable  precision modes are on **page 6** (INT8, FP16, FP32).

  * Noteworthy re RTX2080S
    * VendorB "RTX 2080 Super are EOL"
    * VendorA "neigh impossible to obtain any of them"
  * Noteworthy re RTX3060Ti
    * VendorB "The 3060 Ti does not have the proper cooling for data center use and are not built for that environment"
    * VendorA "Lead times on the new GPUs are generally 2 months or more."

^  VendorB1  ^^  Notes  ^  VendorA1  ^^^  VendorA2  ^^^
|  |||  |||  |||
^  Head Node   ^^  incl switches  ^  Head Node   ^^^  Head Node   ^^^
| Rack | 1U  |  |  1U |||  same |||
| Power | 1+1  |  208V  |  1+1 |||  same |||
| Nic | 2x1G+4x10G  |  +PCI  |  4x10G |||  same |||
| Rails | 25  |    |  25-33 |||  same |||
| CPU | 2x6226R  |  Gold  |  2x5222 |||  same |||
| cores | 2x16  |  Physical  |  2x4 |||  same |||
| ghz | 2.9  |    |  3.8 |||  same |||
| ddr4 | 192  |  gb  |  96 |||  same |||
| hdd | 2x480G  |   ssd (raid1)   |  2x960 |||  same |||
| centos | 8  |  yes  |  8 |||  same |||
| OpenHPC | yes  |  "best effort"  |  no |||  same |||
^  GPU Compute Node    ^^    ^  GPU Compute Node   ^^^  GPU Compute Node   ^^^
| Rack | 2U  |    |  4U |||  same |||
| Power | 1  |  208V  |  1+1 |||  same |||
| Nic | 2x1G+2x10G  |  +PCI  |  2x10G |||  same |||
| Rails | ?  |    |  26-36 |||  same |||
| CPU | 2x4214R  |  Silver  |  2x4214R |||  same |||
| cores | 2x12  |  Physical  |  2x12 |||  same |||
| ghz | 2.4  |    |  2.4 |||  same |||
| ddr4 | 192  |  gb  |  192 |||  same |||
| hdd | 480G  |  <ssd,sata>  |  2T |||  same |||
| centos | 8  |  with gpu drivers, toolkit  |  8 |||  same |||
| GPU | 4x(RTX 5000)  |  active cooling  |  4x(RTX 5000) |||  4x(RTX 6000) |||
| gddr6 | 16  |  gb  |  16 |||  24 |||
^   ^^^   ^^^  ^^^
| Switch | 1x(8+1)  |  <-- add self spare!  |  2x(16+2) |||  same |||
| S&H | tbd  |  |  tbd |||  tbd |||
| Δ | -5  |   target budget $k   |  -2.8 |||  +1.5 |||


  * RTX 5000 gpu teraflop compute capacity depends on compute mode
    * 0.35 TFLOPS (FP64), 11.2 TFLOPS (FP32), 22.3 TFLOPS (FP16), 178.4 TFLOPS (INT8)
  * RTX 6000 gpu teraflop compute capacity depends on compute mode
    * 0.51 TFLOPS (FP64), 16.3 TFLOPS (FP32), 32.6 TFLOPS (FP16), 261.2 TFLOPS (INT8)

From NVIDIA's GeForce forums web site

<code>

Quadro RTX 5000 vs RTX 2080 

both have effective 14000Mhz GDDR6
both have 64 ROPS.

5000 has 16GB vs 2080's 8GB
5000 has 192 TMU's vs the 2080's 184
5000 has 3072 shaders vs the 2080's 2944

the 5000 has a base clock of 1350 and average boost to 1730
the 2080 has a base clock of 1515 and average boost to 1710
the 5000 has 384 tensor cores vs the 2080's 368.
the 5000 has 48 RT cores vs the 2080's 46.

5000
Pixel Rate    110.7 GPixel/s 
Texture Rate    332.2 GTexel/s 
FP16 (half) performance    166.1 GFLOPS (1:64) 
FP32 (float) performance    10,629 GFLOPS 
FP64 (double) performance    332.2 GFLOPS (1:32)

2080
Pixel Rate    109.4 GPixel/s 
Texture Rate    314.6 GTexel/s 
FP16 (half) performance    157.3 GFLOPS (1:64) 
FP32 (float) performance    10,068 GFLOPS 
FP64 (double) performance    314.6 GFLOPS (1:32) 

</code>


==== Cottontail2 ====

The next step in the evolution of our HPCC platform involves a new primary login node (from ''cottontail'' to ''cottontail2'', to be purchased in early 2021) with a migration to [[https://openhpc.community/|OpenHPC]] platform and the [[https://slurm.schedmd.com/documentation.html|Slurm]] scheduler.  Proposals for one head node plus 2 compute nodes for a test and learn setup.  Vastly different compute nodes so Slurm resource discovery and allocation can be tested. Along with scheduler Faishare policy. A chance to test out the A100 gpu. 

Switching to RJ45 10GBase-T network in this migration. And adopting CentOS 8 (possibly the Stream version as events unfold ... [[https://www.hpcwire.com/off-the-wire/centos-project-shifts-focus-to-centos-stream/|CentOS Stream ]] or [[http://rockylinux.org|Rocky Linux]]).  


**Whoooo! Check this out** https://almalinux.org/
  * rhel 1:1 feature compatible, thus centos
  * simply switch repos
  * out Q1/2021

Also sticking to a single private network for scheduler and home directory traffic, at 10G, for each node in the new environment. The second 10G interface (onboot=no) could be brought up for future use in some scenario. Maybe a second switch for network redundancy. Keep private network 192.168.x.x for openlava/warewulf6 traffic, and private network 10.10.x.x for slurm/warewulf8 traffic, avoids conflicts.

The storage network is on 1G, wonder if we could upgrade this later as 10G network grows (options were 6x1G or 4x10G). Or we move to 10G by adding replication partner in 3 years and switching roles between TrueNAS/ZFS units. (LACP the 6x1G into 3x2G)

Lots of old compute nodes will remain on 1G network. Maybe the newest hardware (n79-n90 nodes with RTX20280S gpus) could be upgraded to 10G using PCI cards?

  * **Mental Note** should we participate in InCommon Federation Identity Management
    * would likely be very messy and break current account creation/propagation
  * ** Wait** for ERN's Architecture and Federation Working Group support services (mid 2021?)
    * InCommon, Internet2, OpenID, whatever ERN working group comes up with...
  * **Ugh** The entire cpu/gpu usage accounting needs to be recreated 
    * both scheduler's results need merging together for some time?

^  ^VendorA  ^ VendorB  ^VendorC  ^Notes  ^
|  |||||
^  Head Node  ^^^^^
| Rack | 1U | 1U | 1U |  |  
| Power | 1+1 | 1+1 | 1+1 | 208V |
| Nic | 4x10GB | 2x1G,2x10G | 4x10G  | B:4x10G on PCI? |
| Rails |26-33 | 25 | ? |  |
| CPU | 2x5222 | 2x6226R | 2x5222 | Gold, Gold, Gold |
| cores | 2x4 | 2x16 | 2x4 | Physical |
| ghz | 3.8 | 2.9 | 3.8 |  |
| ddr4 | 96 | 192 | 96 | gb |
| hdd | 2x960G | 2x480G | 2x480 | ssd, ssd, ssd (raid1) |
| centos | 8 | 8 | no |  |
| OpenHPC | no | yes | no | y="best effort" |
^  CPU Compute Node  ^^^^^
| Rack | 1U | 2U | 1U |  | 
| Power | 1+1 | 1 | 1+1 | 208V |
| Nic | 2x10G | 2x1G,2x10G | 2x10G | B:4x10G on PCI? |
| Rails | 26-33 | ? | ? |  |
| CPU | 2x6226R | 2x6226R | 2x6226R | Gold, Gold, Gold |
| cores | 2x16 | 2x16 | 2x16 | Physical |
| ghz | 2.9 | 2.9 | 2.9 |  |
| ddr4 | 192 | 192 | 192 | gb |
| hdd | 2T | 480G | 2x2T | sata, ssd, sata |
| centos | 8 | 8 | no |  |
^  CPU-GPU Compute Node  ^^^^^
| Rack | 4U | 2U | 1U |  |
| Power | 1+1 | 1 | 1+1 | 208V | 
| Nic | 2x10G | 2x1G,2x10G | 2x10G | B:4x10G on PCI? | 
| Rails | 26-36 | ? | ? |  |
| CPU | 2x4210R | 2x4214R | 2x4210R | Silver, Silver, Silver |
| cores | 2x10 | 2x12 | 2x10 | Physical |
| ghz | 2.4 | 2.4 | 2.4 |  |
| ddr4 | 192 | 192 | 192 | gb |
| hdd | 2T | 480G | 2x2T | sata, ssd, sata |
| centos | 8 | 8 | 8 | with gpu drivers, toolkit |
| GPU | 1xA100 | 1xA100 | 1xA100 | can hold 4, passive |
| hbm2 | 40 | 40 | 40 | gb memory |
| mig | yes | yes | yes | up to 7 vgpus |
| sdk | ? | - | - |  |
| ngc | ? | - | - |  |
|  |||||
| Switch | add! | 8+1 | 16+2 | NEED 2 OF THEM? |
| S&H | incl | tbd | tbd |  |
| Δ | +2.4 | +4.4 | +1.6 | target budget $k |

  * cpu and cpu-gpu teraflop compute capacity (FP64) 
    * for cpus on one compute node, theoretical performance: 
    * 2.96 TFLOPS for Gold cpus, 0.8 TFLOPS for Silver cpus
  * gpu teraflop compute capacity depends on compute mode
    * on one A100 gpu node, base performance:
    * 8.7 TFLOPS (FP64), 17.6 TFLOPS (FP32), 70 TFLOPS (FP16)


----


 GFLOPS = #chassis * #nodes/chassis * #sockets/node * #cores/socket * GHz/core * FLOPs/cycle

Note that the use of a GHz processor yields GFLOPS of theoretical performance. Divide GFLOPS by 1000 to get TeraFLOPS or TFLOPS.

http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/2329 

==== Todos ====

  * unrack, recycle old cottontail2 (save /usr/local/src?)
  * relocate in greentail rack server whitetail (ww7)
  * unrack, recycle hp disk array (takers? 1T SAS, 48 drives)
  * recycle cottontail disk array (takers? 2T SATA, 52 drives)


  * single 10G top of rack switches (private subnet)
  * rack new cottontail2, n91+n92 with 1U airflow spacer


\\
**[[cluster:0|Back]]**