This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
cluster:184 [2019/09/03 14:25] hmeij07 [2019 GPU Expansion] |
cluster:184 [2020/01/03 13:22] (current) hmeij07 |
||
---|---|---|---|
Line 2: | Line 2: | ||
**[[cluster: | **[[cluster: | ||
+ | ==== Turing/ | ||
+ | |||
+ | * https:// | ||
+ | |||
+ | ==== AWS deploys T4 ==== | ||
+ | |||
+ | * https:// | ||
+ | |||
+ | Look at this, the smallest Elastic Cloud Compute Instances are **g4dn.xlarge** yielding access to 4 vCPUs, 16GiB memory and 1x T4 GPU. The largest is **g4dn.16xlarge** yielding access to 64 vCPUs 256 GiB memory and 1x T4 GPUs. Now the smallest is priced at $0.526/hr, and running that card 24/7 for a year is a cost of $4,607.76 ... meaning ... option #7 below with 26 GPUs would cost you a whopping $119,802. Annually! That's the low tide water mark. | ||
+ | |||
+ | The high tide water mark? The largest instance is priced at $4.352 and would cost you near one million dollars to run per year if you matched option #7. | ||
+ | |||
+ | Rival cloud vendor Google also offers Nvidia T4 GPUs in its cloud; Google announced global availability back in April. Google Cloud’s T4 GPU availability includes three regions each in the U.S. and Asia and one each in South America and Europe. That page mentions a price of "as low as $0.29 per hour per GPU" which translates to $66K per year matching option #7 below. Still. Insane. | ||
+ | |||
+ | * https:// | ||
==== 2019 GPU Expansion ==== | ==== 2019 GPU Expansion ==== | ||
+ | |||
+ | More focus... | ||
+ | |||
+ | * Vendor A: | ||
+ | * Option 1: 48 gpus, 12 nodes, 24U, each: two 4214 12-core cpus (silver), 96 gb ram, 1tb SSD, four NVIDIA RTX 2080 SUPER 8GB GPU, centos7 yes, cuda yes, 3 yr, 2x gbe nics, 17.2w 31.5d 3.46h" | ||
+ | | ||
+ | With the Deep Learning Ready docker containers...[[cluster: | ||
+ | |||
+ | The SUPER model quote above is what we selected\\ | ||
+ | --- // | ||
+ | |||
+ | |||
+ | Focus on RTX2080 model... | ||
+ | |||
+ | * Vendor A: | ||
+ | * Option 1: 48 gpus, 12 nodes, 24U, each: two 4116 12-core cpus (silver), 96 gb ram, 1tb SSD, four rtx2080 gpus (8gb), | ||
+ | * Option 2: 40 gpus, 10 nodes, 20U, each: two 4116 12-core cpus (silver), 96 gb ram, 1tb SSD, four rtx2080ti gpus (11gb), | ||
+ | * A1+A2 installed, configured and tested: NGC Docker containers Deep Learning Software Stack: NVIDIA DIGITS, TensorFlow, Caffe, NVIDIA CUDA, PyTorch, RapidsAI, Portainer ... NGC Catalog can be found at | ||
+ | https:// | ||
+ | | ||
+ | * Vendor B: | ||
+ | * Option 1: 36 gpus, 9 nodes, 18U, each: two 4214 12-core cpus (silver), 96 gb ram, 2x960gb SATA, four rtx2080tifsta gpus (11gb), | ||
+ | | ||
+ | * Vendor C: | ||
+ | * Option 1: 40 gpus, 10 nodes, 40U, each: two 4214 12-core cpus (silver), 96 gb ram, 240 gb SSD, four rtx2080ti gpus (11gb), | ||
+ | * Option 2: 48 gpus, 12 nodes, 48U, each: two 4214 12-core cpus (silver), 96 gb ram, 240 gb SSD, four rtx2080s gpus (8gb), | ||
+ | | ||
+ | * Vendor D: | ||
+ | * Option 1: 48 gpus, 12 nodes, 12U, each: two 4214 12-core cpus (silver), 64 gb ram, 2x480gb SATA, four rtx2080s gpus (8gb), | ||
Ok, we try this year. Here are some informational pages. | Ok, we try this year. Here are some informational pages. | ||
+ | * [[cluster: | ||
* [[cluster: | * [[cluster: | ||
* [[cluster: | * [[cluster: | ||
* [[cluster: | * [[cluster: | ||
- | ^ Vendor | ||
- | ^ Quote ^ #1 ^ #2 ^ #1 ^ #2 ^ #1 ^ #2 ^ #1 ^ #2 ^ | | ||
- | | Nodes | | | | | | | | | U | | ||
- | | Cpus | | | | | | | | | | | ||
- | | Cores | | | | | | | | | physical | ||
- | | Gpus | | | | | | | | | | | ||
- | | Cores | | | | | | | | | cuda | | ||
- | | Cores | | | | | | | | | tensor | ||
- | | Tflops | ||
- | | Tflops | ||
- | Exxactcorp: For the GPU discussion, 2 to 4 GPUs per node is fine. T4 GPU is 100% fine , and the passive heatsink is better not worse. The system needs to be one that supports passive Tesla cards and the chassis fans would simply ramp to cool the card properly, as in any passive tesla situation. | + | * All GPU cards able to do single and double precision (fp64/ |
+ | * Tensor cores are 4 single precision cores able to return double precision results | ||
+ | * GPU cards performance on double precision depends on the quantity of tensors | ||
+ | * CPU model/type determines dpfp/cycle; silver 16, gold 32. | ||
+ | |||
+ | Criteria for selection (points of discussion raised at last meeting 08/ | ||
+ | - Continue with current work load, just more of it (RTX2080ti/ | ||
+ | - Do above, and enable beginners level intro into Deep Learning (T4) | ||
+ | - Do above, but invest for future expansion into complex Deep Learning (RTX6000) | ||
+ | |||
+ | //**Pick your option and put it in the shopping cart**// | ||
+ | Table best read from the bottom up to assess differences. | ||
+ | |||
+ | ^ Options | ||
+ | ^ ^ #1 ^ #2 ^ #3 ^ #4 ^ #5 ^ #6 ^ #7 ^ #8 ^ #9 ^ #10 ^ ^ | ||
+ | ^ ^ rtx2080ti | ||
+ | | Nodes | 6 | 4 | 9 | 7 | 5 | 17 | 13 | 8 | 8 | 6 | total| | ||
+ | | Cpus | 12 | 8 | 18 | 14 | 10 | 34 | 26 | 16 | 16 | 12 | total| | ||
+ | | Cores | 96 | 64 | 180 | 140 | 100 | 272 | 208 | 192 | 128 | 72 | physical| | ||
+ | | Tflops | ||
+ | | Gpus | 48 | 16 | 36 | 28 | 20 | 34 | 26 | 16 | 28 | 60 | total| | ||
+ | | Cores | 209 | 74 | 157 | 72 | 92 | 75 | 67 | 74 | 72 | 138 | cuda K| | ||
+ | | Cores | 26 | 9 | 20 | 8.9 | 11.5 | 10 | 8 | 9 | 9 | 17 | tensor K| | ||
+ | | Tflops | ||
+ | | Tflops | ||
+ | | $/ | ||
+ | ^ Per Node ^^^^^^^^^^^^ | ||
+ | | Chassis | ||
+ | | CPU | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | total| | ||
+ | | | 4208 | 4208 | 5115 | 5115 | 5115 | 4208 | 4208 | 4214 | 4208 | 4208 | model| | ||
+ | | | silver | ||
+ | | | 2x8 | 2x8 | 2x10 | 2x10 | 2x10 | 2x8 | 2x8 | 2x12 | 2x8 | 2x8 | physical| | ||
+ | | | 2.1 | 2.1 | 2.4 | 2.4 | 2.4 | 2.1 | 2.1 | 2.2 | 2.1 | 2.1 | Ghz| | ||
+ | | | 85 | 85 | 85 | 85 | 85 | 85 | 85 | 85 | 85 | 85 | Watts| | ||
+ | | DDR4 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | GB mem| | ||
+ | | | 2933 | 2933 | 2266 | 2666 | 2666 | 2666 | 2666 | 2933 | 2933 | 2666 | Mhz| | ||
+ | | Drives | ||
+ | | | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | 2.5 | SSD/HDD| | ||
+ | | GPU | 8 | 4 | 4 | 4 | 4 | 2 | 2 | 2 | 4 | 10 | total| | ||
+ | | | RTX | RTX | RTX | T | RTX | RTX | T | RTX | T | RTX | arch| | ||
+ | | | 2080ti | ||
+ | | | 11 | 24 | 11 | 16 | 24 | 8 | 16 | 24 | 16 | 8 | GB mem| | ||
+ | | | 250 | 295 | 250 | 70 | 295 | 160 | 70 | 295 | 70 | 160 | Watts| | ||
+ | | Power | 2200 | 1600 | 1600 | 1600 | 1600 | 1600 | 1600 | 2200 | 1600 | 2000 | Watts| | ||
+ | | | 1+1 | 1+1 | 1+1 | 1+1 | 1+1 | 1+1 | 1+1 | 1+1 | 1+1 | 2+2 | redundant| | ||
+ | | CentOS7 | ||
+ | | Nics | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | gigabit| | ||
+ | | Warranty | ||
+ | | | -3 | -6 | -1 | -1 | -5.5 | 0 | +1.6 | 0 | +1.5 | -1 | Δ | | ||
+ | |||
+ | * #1/#2 All GPU warranty requests will be filled by GPU maker. | ||
+ | * #7 up to 4 GPUs per node. Filling rack leaving 1U open between nodes, count=15 | ||
+ | * #8 fills intended rack with AC in rack. GPU Tower/4U rack mount. | ||
+ | * #8 includes NVLink connector (bridge kit). Up to 4 GPUs per node. | ||
+ | * Tariffs may affect all quotes when executed. | ||
+ | * S&H included (or estimated) | ||
+ | * More than 4-6 nodes would be lots of work if Warewulf/ | ||
+ | |||
+ | On the question of active versus passive cooling: | ||
+ | |||
+ | |||
+ | **Exxactcorp**: For the GPU discussion, 2 to 4 GPUs per node is fine. T4 GPU is 100% fine , and the passive heatsink is better not worse. The system needs to be one that supports passive Tesla cards and the chassis fans would simply ramp to cool the card properly, as in any passive tesla situation. | ||
+ | |||
+ | **Microway**: | ||
+ | |||
+ | **ConRes**: In regards to the question around active vs. passive cooling on GPUs, the T4, V100, and other passively cooled GPUs are intended for 100% utilization and actually can offer better cooling and higher density in a system than active GPU models. | ||
- | Microway: For this mix of workloads two to four GPUs per node is a good balance. Passive GPUs are *better* for HPC usage. All Tesla GPUs for the last 5? years have been passive. I'd be happy to help allay any concerns you may have there. The short version is that the GPU and the server platform communicate as to the GPUs temperature. The server adjusts fan speeds appropriately and is able to move far more air than a built-in fan would ever be able to. | ||
==== Summary ==== | ==== Summary ==== | ||
Line 35: | Line 134: | ||
We are embarking on expanding our GPU compute capacity. To that end we tested some of the new GPU models. During a recent users group meeting the desire was also expressed to enable our option to enter the deep learning (DL) field in the near future. We do not anticipate to run Gaussian on these GPUs so are flexible in the mixed precision mode models. The list of software, with rough usage estimates and precision modes, is; amber (single, 25%), lammps (mixed, 20%), gromacs (mixed, 50%) and python bio-sequencing models (mixed or double, < 5%). | We are embarking on expanding our GPU compute capacity. To that end we tested some of the new GPU models. During a recent users group meeting the desire was also expressed to enable our option to enter the deep learning (DL) field in the near future. We do not anticipate to run Gaussian on these GPUs so are flexible in the mixed precision mode models. The list of software, with rough usage estimates and precision modes, is; amber (single, 25%), lammps (mixed, 20%), gromacs (mixed, 50%) and python bio-sequencing models (mixed or double, < 5%). | ||
| | ||
+ | |||
We anticipate the best solution to be 2-4 GPUs per node and not an ultra dense setup. | We anticipate the best solution to be 2-4 GPUs per node and not an ultra dense setup. | ||
| | ||
+ | |||
We do not have a proven imaging functionality with CentOS7, Warewulf and UEFI booting so all nodes should be imaged. Software to install is latest versions of amber (Wes to provide proof of purchase), lammps (with packages yes-rigid, yes-gpu, yes-colloid, | We do not have a proven imaging functionality with CentOS7, Warewulf and UEFI booting so all nodes should be imaged. Software to install is latest versions of amber (Wes to provide proof of purchase), lammps (with packages yes-rigid, yes-gpu, yes-colloid, | ||
- | DL software list: Pytorch, Caffe, Tensorflow. | + | |
- | Wes to install and configure scheduler client and queue. | + | DL software list: Pytorch, Caffe, Tensorflow. \\ |
- | Wes to provide two gigabit ethernet switches. | + | Wes to install and configure scheduler client and queue.\\ |
+ | Wes to provide two gigabit ethernet switches.\\ | ||
| | ||
+ | |||
Compute nodes should have 2 ethernet ports, single power ok but prefer redundant, dual CPUs with optimized memory configuration around 96-128 Gb. Start IP address ranges; nic1 192.168.102.89, | Compute nodes should have 2 ethernet ports, single power ok but prefer redundant, dual CPUs with optimized memory configuration around 96-128 Gb. Start IP address ranges; nic1 192.168.102.89, | ||
| | ||
+ | |||
Wes will provide 208V powered rack with 7K BTU cooling AC. Standard U42 rack (rails at 30", up to 37" usable). We also have plenty of shelves to simply hold the servers if needed. Rack contains two PDUs (24A) supplying 2x30 C13 outlets. [[https:// | Wes will provide 208V powered rack with 7K BTU cooling AC. Standard U42 rack (rails at 30", up to 37" usable). We also have plenty of shelves to simply hold the servers if needed. Rack contains two PDUs (24A) supplying 2x30 C13 outlets. [[https:// | ||