cluster:166 [DokuWiki]

cluster:166

Back

HPC Users Meeting

Brief history
- 2006 swallowtail (Dell PE1955, Infiniband, imw, emw)
- 2010 greentail (HP gen6 blade servers, hp12)
- 2013 sharptail (Microway storage, K20s, Infiniband, mw256/mwgpu)
- 2014 mw256fd (Dell 2006 replacement with Supermicro nodes)
- 2015 tinymem (Supermicro bare metal, expansion for serial jobs)
- 2017 mw128 (first new faculty startup funds)
- 2018 6/25 Today's meeting

Since 2006
- Grown from 256 to roughly 1,200 physical CPU cores
- Processed 3,165,752 jobs (by 18jun2018)
- Compute capacity over 60 teraflops (DPFP; 38 cpu side, 25 gpu side)
- Total memory footprint is near 7.5 TB
- About 500 accounts have been created (incl 22 collaborator and 100 class accounts)

Funding / charge scheme: is it working for you?
- Last 2 years, $15K target realized each year.

Status of our cluster development fund
- $140K come July 1st, 2018
- Time for some new hardware? Retirement of hp12 nodes?

2017 Benchmarks of some new hardware
- Donation led to purchase of a commercial grade GPU server containing four GTX1080ti GPUs
- Amber 16. Nucleosome bench runs 4.5x faster than on a K20
- Gromacs 5.1.4. Colin's multidir bench runs about 2x faster than on a K20
- Lammps 11Aug17. Colloid example runs about 11x faster than on a K20
- FSL 5.0.10. BFT bedpostx tests run 16x faster on CPU, a whopping 118x faster on GPU vs CPU.
- Price of 128gb node in 2017 was $8,250…price of 256gb node in 2018 is $10,500

2016 IBM bought Platform Inc (developers of LSF, Openlava is LSF4.2 open source branch)
- IBM promptly accused Openlava of copyright infringement in v3/v4 (US DMCA law, no proof needed).
- Fall back option to v2.2 (definitely free of infringement, minor disruption)
- Move forward option, adopt SLURM (LBL developers, major disruption)

If we adopt SLURM should we transition to OpenHPC Warewulf/SLURM recipe?
- http://openhpc.community/
- new login node and couple compute nodes to start?

New HPC Advisory Group Member

Tidbits
- Bought deep U42 rack with AC cooling onboard and two PDUs
- Pushed Angstrom rack (bss24) out of our area, ready to recycle that (Done. 06/20/2018)
- Currently we have two U42 racks empty with power
- Cooling needs to be provided with any new major purchases (provost, ITS, HPC?)
- 60 TB raw storage purchased for sharptail (/home2 for users with specific needs)
- Everything is out of warranty but
  - cottontail (03/2019),
  - ringtail & n78 (10/2020)
  - mw128_nodes & shartptaildr (06/2020)
- All Infiniband ports are in use

Notes

First make a page comparing CPU vs GPU usage which may influence future purchase CPU vs GPU
$100k quote, 3to5 vendors, data points mid-2018
One node (or all) should have configured on it: amber, gromacs, laamps, namd, latest version
Nvidia latest version, optimal configs cpu:gpu ratios
- Amber 1:1 (may be 1:2 in future releases) - amber certified GPU!
- Gromacs 10:1 (could ramp up to claiming all resources per node)
- Namd 13:1 (could ramp up to claiming all resources per node)
- Lammps 2-4:1
128g with enough CPU slots to take over hp12: double ten core minimum
Anticipated target (also to manage heat exchange)
- 2×10 Xeon CPU (~100gb left) with 2xgtx1080ti GPU (25gb memory required)
- as many as fit budget but but no more than 15 rack wise

Back

cluster/166.txt · Last modified: 2018/06/27 11:51 by hmeij07