Differences

This shows you the differences between two versions of the page.

--- cluster:184 [2019/09/05 11:56]
hmeij07 [2019 GPU Expansion]
+++ cluster:184 [2019/09/07 14:08]
hmeij07 [Summary]
@@ Line 11: / Line 11: @@
   * [[cluster:182|P100 vs RTX 6000 & T4]] page
-^  Vendor  ^  A  ^^  B  ^^  C  ^^  D  ^^  Notes  ^
+^  Option  ^  #1  ^  #2  ^  #3  ^  #4  ^  #5  ^  #6  ^  #7  ^  #8  ^    |
-^  Quote  ^  #1  ^  #2  ^  #1  ^  #2  ^  #1  ^  #2  ^  #1  ^  #2  ^    |
 |  Nodes  |    |    |    |    |    |    |    |    |  U  |
 |  Cpus  |    |    |    |    |    |    |    |    |    |
@@ Line 38: / Line 37: @@
 We are embarking on expanding our GPU compute capacity. To that end we tested some of the new GPU models. During a recent users group meeting the desire was also expressed to enable our option to enter the deep learning (DL) field in the near future. We do not anticipate to run Gaussian on these GPUs so are flexible in the mixed precision mode models. The list of software, with rough usage estimates and precision modes, is; amber (single, 25%), lammps (mixed, 20%), gromacs (mixed, 50%) and python bio-sequencing models (mixed or double, < 5%).
 We anticipate the best solution to be 2-4 GPUs per node and not an ultra dense setup.  Job usage pattern is mostly one job per GPU with exclusive access to allocated GPU, albeit that pattern may change based on GPU memory footprint. We were zooming in on the RTX 6000 or TITAN GPU models but are open to suggestions. The T4 looks intriguing but the passive heat sink bothers us (does that work under near constant 100% utilization rates?).
 We do not have a proven imaging functionality with CentOS7, Warewulf and UEFI booting so all nodes should be imaged. Software to install is latest versions of amber (Wes to provide proof of purchase), lammps (with packages yes-rigid, yes-gpu, yes-colloid, yes-class2, yes-kspace, yes-misc, yes-molecule), gromacs (with -DGMX_BUILD_OWN_FFTW=ON). All MPI enabled with OpenMPI. Latest Nvidia CUDA drivers. Some details if you need them at this web page: https://dokuwiki.wesleyan.edu/doku.php?id=cluster:172
-DL software list: Pytorch, Caffe, Tensorflow.
-Wes to install and configure scheduler client and queue.
+DL software list: Pytorch, Caffe, Tensorflow. \\
-Wes to provide two gigabit ethernet switches.
+Wes to install and configure scheduler client and queue.\\
+Wes to provide two gigabit ethernet switches.\\
 Compute nodes should have 2 ethernet ports, single power ok but prefer redundant, dual CPUs with optimized memory configuration around 96-128 Gb. Start IP address ranges; nic1 192.168.102.89, nic2 10.10.102.89, ipmi 192.168.103.89, netmask 255.255.0.0 for all.
 Wes will provide 208V powered rack with 7K BTU cooling AC. Standard U42 rack (rails at 30", up to 37" usable). We also have plenty of shelves to simply hold the servers if needed. Rack contains two PDUs (24A) supplying 2x30 C13 outlets. [[https://www.rackmountsolutions.net/rackmount-solutions-cruxial-cool-42u-7kbtu-air-conditioned-server-cabinet/|External Link]]

DokuWiki

User Tools

Site Tools

Differences

Page Tools