This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
cluster:107 [2012/12/17 18:50] hmeij [ConfCall & Quote: HP] |
cluster:107 [2013/01/16 15:20] hmeij [ConfCall & Quote: MW] |
||
---|---|---|---|
Line 299: | Line 299: | ||
* buy a single rack and test locally, start small (will future racks be compatible? | * buy a single rack and test locally, start small (will future racks be compatible? | ||
+ | |||
+ | ==== Yale Qs ==== | ||
+ | |||
+ | Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC ... | ||
+ | |||
+ | * What was the most important design element of the cluster? | ||
+ | * What factor(s) settled the CPU to GPU ratio? | ||
+ | * Was either, or neither, single or double precision peak performance more/less important? | ||
+ | * What was the software suite in mind (commercial, | ||
+ | * How did you reach out/educate users on the aspects of GPU computing? | ||
+ | * What was the impact on the users? (recoding, recompiling) | ||
+ | * Was the expected computational speed up realized? | ||
+ | * Was the PGI Accelerator compilers leveraged? If so what were the results? | ||
+ | * Do users compile with nvcc? | ||
+ | * Does the scheduler have a resource for idle GPUs so they can be reserved? | ||
+ | * How are the GPUs exposed/ | ||
+ | * Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs? | ||
+ | * Can parallel jobs access mutliple GPUs across nodes? | ||
+ | * Any experiences with pmemd.cuda.MPI (part of Amber)? | ||
+ | * What MPI flavor is used most in regards to GPU computing? | ||
+ | * Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs) | ||
+ | |||
+ | Notes 04/01/2012 ConfCall | ||
+ | |||
+ | * Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3 | ||
+ | * Users did not share GPUs but could obtain more than one, always on same node | ||
+ | * Experimental setup with 36 gb/node, dual 8 core chips | ||
+ | * Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed | ||
+ | * No raw code development | ||
+ | * Speed ups was hard to tell | ||
+ | * PGI Accelerator was used because it is needed with any Fortran code (Note!) | ||
+ | * Double precision was most important in scientific applications | ||
+ | * MPI flavor was OpenMPI, and others (including MVApich) showed no advantages | ||
+ | * Book: Programming Massively Parallel Processors, Second Edition: | ||
+ | * A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012) | ||
+ | * Has examples of how to expose GPUs across nodes | ||
==== ConfCall & Quote: AC ==== | ==== ConfCall & Quote: AC ==== | ||
Line 321: | Line 357: | ||
^ Topic^Description | ^ Topic^Description | ||
- | | General| 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node| | + | | General| 2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node| |
| Head Node| None| | | Head Node| None| | ||
| Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/ | | Nodes|1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/ | ||
Line 448: | Line 484: | ||
* < | * < | ||
* This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement | * This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement | ||
+ | |||
+ | * 2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher | ||
+ | * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1, | ||
+ | * peak performance: | ||
+ | * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the " | ||
^ Topic^Description | ^ Topic^Description | ||
Line 471: | Line 512: | ||
* To compare with “benchmark option” performance; | * To compare with “benchmark option” performance; | ||
- | * 2 chassis | + | * When quote is reduced to 1x s6500 chassis |
- | * 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1, | + | * To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores) |
- | * peak performance: 37.44 double, 112.64 single | + | * To compare with “benchmark option” |
- | * 1 chassis in 4U + 2 Sl250s + each with * GPUs would the " | + | |
* HP on site install | * HP on site install | ||
Line 567: | Line 607: | ||
| | 5x upgrade to 64 GB per node| | | | 5x upgrade to 64 GB per node| | ||
- | * 2% more expansive than benchmark option, else identical | + | |
+ | |||
+ | | ||
* But a new rack (advantageous for data center) | * But a new rack (advantageous for data center) | ||
* With lifetime technical support | * With lifetime technical support | ||
* solid state drives on compute nodes | * solid state drives on compute nodes | ||
+ | * 12 TB local storage | ||
- | * 36 port FDR switch overkill | + | Then |
- | * substitute with 12 port QDR switch | + | |
- | * and all servers to QDR | + | |
- | * Execute the Upgrade packages | + | |
+ | * 36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps) | ||
+ | * and all server adapter cards to QDR (with one hook up to existing Voltaire switch) | ||
+ | * Expand memory footprint | ||
+ | * Go to 124 GB memory/noe to beef up the CPU HPC side of things | ||
+ | * 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core | ||
+ | * Online testing available (K20, do this) | ||
+ | * then decide on PGI compiler at purchase time | ||
+ | * maybe all Lapack libraries too | ||
+ | * Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?) | ||
+ | * Leave the 6x2TB disk space (for backup) | ||
+ | * 2U, 8 drives up to 6x4=24 TB, possible? | ||
+ | * Add an entry level Infiniband/ | ||
+ | * for parallel file locking | ||
- | * Online testing available | + | * Spare parts |
+ | * 8 port switch, HCAs and cables, drives ... | ||
+ | * or get 5 years total warranty | ||
+ | * Testing notes | ||
+ | * Amber, LAMMPS, NAMD | ||
+ | * cuda v4&5 | ||
+ | * install/ | ||
+ | * use gnu ... with openmpi | ||
+ | * make deviceQuery | ||
\\ | \\ | ||
**[[cluster: | **[[cluster: |