User Tools

Site Tools


cluster:200

This is an old revision of the document!



Back

Cottontail2

The next step in the evolution of our HPCC platform involves a new primary login node (from cottontail to cottontail2, to be purchased in early 2021) with a migration to OpenHPC platform and the Slurm scheduler. Proposals for one head node plus 2 compute nodes for a test and learn setup. Vastly different compute nodes so Slurm resource discovery and allocation can be tested. Along with scheduler Faishare policy. A chance to test out the A100 gpu.

Switching to RJ45 10GBase-T network in this migration. And adopting CentOS 8 (possibly the Stream version as events unfold … CentOS Stream or Rocky Linux).

Whoooo! Check this out https://almalinux.org/

  • rhel 1:1 feature compatible, thus centos
  • simply switch repos
  • out Q1/2021

Also sticking to a single private network for scheduler and home directory traffic, at 10G, for each node in the new environment. The second 10G interface (onboot=no) could be brought up for future use in some scenario. Maybe a second switch for network redundancy. Keep private network 192.168.x.x for openlava/warewulf6 traffic, and private network 10.10.x.x for slurm/warewulf8 traffic, avoids conflicts.

The storage network is on 1G, wonder if we could upgrade this later as 10G network grows (options were 6x1G or 4x10G). Or we move to 10G by adding replication partner in 3 years and switching roles between TrueNAS/ZFS units. (LACP the 6x1G into 3x2G)

Lots of old compute nodes will remain on 1G network. Maybe the newest hardware (n79-n90 nodes with RTX20280S gpus) could be upgraded to 10G using PCI cards?

  • Mental Note should we participate in InCommon Federation Identity Management
    • would likely be very messy and break current account creation/propagation
  • Wait for ERN's Architecture and Federation Working Group support services (mid 2021?)
    • InCommon, Internet2, OpenID, whatever ERN working group comes up with…
  • Ugh The entire cpu/gpu usage accounting needs to be recreated
    • both scheduler's results need merging together for some time?
VendorA VendorB VendorC Notes
Head Node
Rack 1U 1U 1U
Power 1+1 1+1 1+1 208V
Nic 4x10GB 2x1G,2x10G 4x10G B:4x10G on PCI?
Rails 26-33 25 ?
CPU 2×5222 2x6226R 2×5222 Gold, Gold, Gold
cores 2×4 2×16 2×4 Physical
ghz 3.8 2.9 3.8
ddr4 96 192 96 gb
hdd 2x960G 2x480G 2×480 ssd, ssd, ssd (raid1)
centos 8 8 no
OpenHPC no yes no y=“best effort”
CPU Compute Node
Rack 1U 2U 1U
Power 1+1 1 1+1 208V
Nic 2x10G 2x1G,2x10G 2x10G B:4x10G on PCI?
Rails 26-33 ? ?
CPU 2x6226R 2x6226R 2x6226R Gold, Gold, Gold
cores 2×16 2×16 2×16 Physical
ghz 2.9 2.9 2.9
ddr4 192 192 192 gb
hdd 2T 480G 2x2T sata, ssd, sata
centos 8 8 no
CPU-GPU Compute Node
Rack 4U 2U 1U
Power 1+1 1 1+1 208V
Nic 2x10G 2x1G,2x10G 2x10G B:4x10G on PCI?
Rails 26-36 ? ?
CPU 2x4210R 2x4214R 2x4210R Silver, Silver, Silver
cores 2×10 2×12 2×10 Physical
ghz 2.4 2.4 2.4
ddr4 192 192 192 gb
hdd 2T 480G 2x2T sata, ssd, sata
centos 8 8 8 with gpu drivers, toolkit
GPU 1xA100 1xA100 1xA100 can hold 4, passive
hbm2 40 40 40 gb memory
mig yes yes yes up to 7 vgpus
sdk ? - -
ngc ? - -
Switch add! 8+1 16+2 NEED 2 OF THEM?
S&H incl tbd tbd
Δ +2.4 +4.4 +1.6 target budget $k
  • cpu and cpu-gpu teraflop compute capacity (FP64)
    • for cpus on one compute node, theoretical performance:
    • 2.96 TFLOPS for Gold cpus, 0.8 TFLOPS for Silver cpus
  • gpu teraflop compute capacity depends on compute mode
    • on one A100 gpu node, base performance:
    • 8.7 TFLOPS (FP64), 17.6 TFLOPS (FP32), 70 TFLOPS (FP16)

GFLOPS = #chassis * #nodes/chassis * #sockets/node * #cores/socket * GHz/core * FLOPs/cycle

Note that the use of a GHz processor yields GFLOPS of theoretical performance. Divide GFLOPS by 1000 to get TeraFLOPS or TFLOPS.

http://en.community.dell.com/techcenter/high-performance-computing/w/wiki/2329

Todos

  • unrack, recycle old cottontail2 (save /usr/local/src?)
  • relocate in greentail rack server whitetail (ww7)
  • unrack, recycle hp disk array (takers? 1T SAS, 48 drives)
  • recycle cottontail disk array (takers? 2T SATA, 52 drives)
  • single 10G top of rack switches (private subnet)
  • rack new cottontail2, n91+n92 with 1U airflow spacer


Back

cluster/200.1610655617.txt.gz · Last modified: 2021/01/14 15:20 by hmeij07