This is an old revision of the document!
“GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications.”
source: http://www.nvidia.com/object/what-is-gpu-computing.html
By connecting a CPU with a GPU (with a specialty card), a job running in a CPU core can offload intensive calculations to the GPU which can in parallel fashion quickly do redundant calculations.
Here is a picture of a dual CPU rack server and a dual GPU rack server connected:
http://www.advancedclustering.com/images/stories/gpu/gpu_cluster_header-600px.png
The leader in the area is Nvidia which produces Tesla GPUs (S2050/S2070 and K10/K20). Nvidia also has developed the CUDA parallel programming model making the GPU programmable with C/C++ and Fortran (and others, see below). This CUDA (software and drivers) needs to be installed and may alter the kernel (look this up?).
So once you stack multiple CPU/GPU units together, via gigabit or 10-gigabit ethernet or infiniband, you have a compute cluster.
GPU clusters perform at 1/10th the cost and 1/20th the power of CPU-only systems based on the latest quad-core CPUs.
Nvidia's top of the line GPU
ExxactCorp bundles K20 GPU with CPUs into a “simcluster”, think over sized coffee table on wheels
Intel new multi core platform. Intel is not going the GPU route.
“Intel is providing software tools so applications can be written or recompiled for the Phi chips. Curley said it is easy to recompile existing x86 code so that high-performance applications can take full advantage of the multicore chips.”
Source:http://www.computerworld.com/s/article/9233498/Intel_ships_60_core_Xeon_Phi_processor?taxonomyId=162&pageNumber=2
So the problem with GPU is that the code needs to change (just like MPI parallel programs). But there is no “GPU” compiler, there is a whole new development environment and tools (see next section). The job running on the CPU core needs to load data into the GPU cores, act on it, and retrieve the results and march on, or redo the operation another bazilion times. That takes code. Matlab, in a wave function rewrite example, shows that 85% of the original code remains untouched; still that's a 15% redo rate.
Digging around, there are third parties that do this (it is massively complicated and driver/card hardware dependent). Some commercial software is GPU enabled and some open source software supports it. (But there are entire lists just dedicated to compiling for GPU). Seems like a serious drawback.
Nvidia maintains a “Catalog of available software” which includes Amber, Lammps, Matlab, Mathematica, and Gaussian (in dev); we'll take a closer look at those next. (Might have to include a software cost component in GPU cluster cost. Can we even buy ready to go software? check).
http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf
Not sure if Amber/GPU is ready-to-go or you must do it yourself. Amber/MPI compilation is a nightmare this is even worse
http://ambermd.org/gpus/ pay particular attention to the scheduler discussion (GPU vs CPU “job slots”)
Ah, here we go, custom order an amber GPU
http://ambermd.org/gpus/recommended_hardware.htm
and
http://exxactcorp.com/ if we go GPU we should get a quote (we could run other stuff on top of this too I suppose, check?)
Get DLBgroup to start testing.
(check this out, it also lists some prices) and provides a testing site to check on speed ups expected with your code
http://exxactcorp.com/testdrive/AMBER/
Your mileage may vary http://ambermd.org/gpus/benchmarks.htm … this is beyond me; double precision needed it looks like.
You can compile your own with the –cuda option (seems complicated).
Oh, another gem, Lammps GPU cluster ready to go (+testing): http://exxactcorp.com/testdrive/LAMMPS/
Get StarrGroup to start testing.
SideBar: Looking at the Exacct site, they build clusters for: Amber, Barracuda (for weir?), Lammps, Namd, and more http://exxactcorp.com/testdrive/
I need to get a K20 cluster quote! With pre-loaded software if we do this.
http://exxactcorp.com/index.php/solution/solu_list/12
This would be a purchase. The low end rates seem barely worth it. I believe the “gpu enabled” Mathematica is a different software install (check).
So here is an example of commercial software that support Nvidia GPUs
http://www.mathworks.com/discovery/matlab-gpu.html
built into the Parallel Toolbox (that we already have, I do not believe this is a different software install). For example, instead of the function getArray{} you would recode it to GPUgetArray{}.
Now we're talking speed up, that seems worth an effort. But we have low VMD usage.
It appears “SAS On Demand”, that is remote access to SAS on a virtual classroom machine for facculty and staff, is powered by an IBM BladeCenter leveraging GPU technology.
http://decisionstats.com/2010/09/16/sasbladesservers-gpu-benchmarks/
Development Tools
https://developer.nvidia.com/cuda-tools-ecosystem
GPU cluster management (includes Platform LSF so identical to Lava which we use)
https://developer.nvidia.com/cluster-management
GPU Programming For The Rest Of Us
http://www.clustermonkey.net/Applications/gpu-programming-for-the-rest-of-us.html
Excellent article on the GPU programming problem
“The Portland Group® (PGI), a wholly-owned subsidiary of STMicroelectronics and the leading independent supplier of compilers and tools for high-performance computing, announced at SC12 today that thePGI 2013 release of its PGI Accelerator™ compilers due out in early December will add support for the new family of NVIDIA® Tesla® K20 GPU accelerators and the CUDA® 5 parallel computing platform and programming model.”
” … and … today announced plans to extend its PGI Accelerator™ compiler technology with OpenACC to Intel® Xeon Phi™ coprocessors, Intel's family of products based on the Intel Many Integrated Core (MIC) architecture.”
Source: http://www.pgroup.com/about/news.htm#56
So that solves some big items (it compiles for both GPU and Phi coprocessor):
What the compiler does is inert pragmas (compiler directives) that are executed if a GPU (or Phi processor) are detected, otherwise the code just excutes like normal. Looks like:
!$acc region !$acc do parallel do j=1,m ...
Here is a very detailed article with excerpts below
http://www.theregister.co.uk/2012/11/12/intel_xeon_phi_coprocessor_launch/
“You add parallel directives to the code, and you compile the code to run on both x86 chips in standalone mode and on the x86-Xeon Phi combination. You get one set of compiled code, and if the Xeon Phi chips are present, the work is offloaded from the server CPUs to the x86 coprocessors, and they do the acceleration. If not, the CPUs in the cluster or the workstation do the math.”
That explains that in detail. So this is the same approach as the Portland GPU compiler. One compiled binary. But potentially a lower performance boost than the GPU approach.
While anticipating a hardware acquisition to delve into the world of GPU computing, I'm trying to abstract out the perceived Scheduler problem. For the sake of this discussion lets assume one 4U rack with 2 CPUs and 4 GPUs. Lets also assume the CPUs are Intel E5-2660 8 cores per CPU 2.0 Ghz (total of 16 cpu cores, “Romnley” series so each CPU can see all GPUs, seems to be the current trend). Assume the GPUs are Nvidia Telsa K20 (roughly 2,500 cuda cores per GPU). We have 4 GB DDR3 memory per GPU, total of 16 GB, and need at least that much for CPUs, but we'll load the board with 64 GB (more on that later).
To generalize; compute jobs that would run on the CPU cores only, in serial or parallel fashion, we'll call cpuHPC. Jobs that run on the GPU cores we'll call gpuHPC. From a Scheduler perspective then we have 16 job slots on the cpuHPC. And 4 perceived job slots on the gpuHPC. But a gpuHPC job, serial or parallel, runs it's serial code on a cpuHPC core and offloads any parallel computations (like loops) to the gpuHPC (serial or parallel). Thus at least 4 of the cpuHPC jobs slots need to be reserved for the gpuHPC, more if you will allow many small gpuHPC jobs, but there are more complications.
In a simple setup you would not allow any non-gpuHPC jobs on this hardware. Even then the Scheduler needs to be aware if GPUs are idle or busy. Nvidia's CUDA SDK provides tools for this like deviceQuery which returns gpuIDs (0-3 in our case). The cpuIDs are provided by the OS (see /proc/cpuinfo, 0-1 in our example). There is a script (gpu-info) that returns the gpuIDs and % utilization. One approach then is to set CUDA_VISIBLE_DEVICES on the node where the gpuHPC job will run and expose only idle GPUs (script gpu-free). Source:http://ambermd.org/gpus/#Running. So in this case the Scheduler can only submit up to 4 gpuHPC serial jobs allowing each gpuHPC job to claim one idle GPU. Less if a parallel MPI gpuHPC job runs across multiple GPUs.
So we need to build some functionality into the submit jobs process that exposes idle GPUs. But how could we do that before submission? IOW let the jobs PEND until an idle GPU is available? We need to build an “elim” script that reports number of available GPUs (integer) and define this Lava resource (lsf.shared). Submit scripts then need to request this resource and quantity. That's perhaps a start, and efficient, if each gpuHPC job used say 99% of the GPU's computational power.
Gets way more complicated if that is not the case. Lets assume a gpuHPC parallel MPI job does not fit inside one GPU but needs 4 GPUs and at peak time utilization is 60% per GPU. (In this scenario no other gpuHPC parallel MPI jobs should be submitted, only serial gpuHPC jobs that can use up to 40% of that computational power of the, now busy, GPUs). Complicated. First the cpuHPC job needing all 4 GPUs needs to make sure: 4 are available and idle, when submission occurs 4 CPU cores need to allocated (does not matter which ones in a one node scenario) and each CPU core gets a unique gpuID assigned. We can do that with the scripts above. Second, the MPI thread count is 4 (-n 4) but we need to make sure they land on one host (span[hosts=1]) and presumably reserve some memory on CPU side (rusage[mem=…]). That can all be done in the one node scenario. In a multiple node setting the hosts=1 constrains the idle GPUs to be on one node.
We could also design a smarter “elim” resource and have % utilization reported (0-100) and use that. (Would we know how much of the GPU computational power we need?). How would one keep a second (parallel MPI) gpuHPC job from being submitted after one is already running? Maybe another elim resource that looks at the presence of a lock file somewhere (0/1) created bu a PREEXEC script scanning for mpirun invocations?
And what happens when you have more than one node? Can a hosts=2 setting a request for 8 GPUs using the MPI GPU enabled binary work? Don't know, we'll see.
Regarding the memory. If one works along the one gpuHPC job per GPU model, then the hardware has a maximm of 4 simultaneous jobs in this example. That leaves 12 cpuHPC cores idle and one can envision setting up another queue for those. So in a minimum memory foot print I'd suggest 16x4GB but if you can afford it, double that.
Start simple.
(As posted on openlava.org forum confirming some things)
Had this thought during my dental cleaning … which I think may be useful in framing our decision:
Or
09nov12:
AC Quote
Topic | Description |
---|---|
General | 2 CPUs (16 cores), 3 GPUs ( 22,500 cuda cores), 32 gb ram/node |
Head Node | None |
Nodes | 1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series |
8x4GB 240-Pin DDR3 1600 MHz memory (32gb/node), 11gb/gpu, max 256gb) | |
1x120GB SATA 2.5“ Solid State Drive (OS drive), 7x3TB 7200RPM | |
3xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio | |
2×10/100/1000 NIC, 3x PCIE 3.0 x16 Slots | |
1xConnectX-3 VPI adapter card, single-port 56Gb/s | |
2x1620W Redundant Power Supplies | |
Network | 1×36 port Infiniband FDR (56Gb/s) switch & 4xConnectX-3 single port FDR (56Gb/s) IB adapter + 2x 2 meter cables (should be 4) |
Power | Rack power ready |
Software | None |
Warranty | 3 Year Parts and Labor (AC technical support) |
GPU Teraflops | 3.51 double, 10.56 single |
Quote | <!-- $33,067.43 S&H included --> Arrived |
12nov12:
EC Quote
Topic | Description |
---|---|
General | 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 64 gb ram/node, plus head node |
Head Node | 1x1U Rackmount System, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores |
8x8GB 240-Pin DDR3 1600 MHz ECC (max 256gb), 2×10/100/1000 NIC, 2x PCIe x16 Full | |
2x2TB 7200RPM (can hold 10), ConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s | |
600w Power Supply | |
Nodes | 4x2U Rackmountable Chassis, 8xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16/node), Romley series |
32x8GB 240-Pin DDR3 1600 MHz (64gb/node memory, 16gb/gpu, max 256gb) | |
4x1TB 7200RPM, 16xNVIDIA Tesla K20 8 GB GPUs (4/node), 1CPU-2GPU ratio | |
2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots | |
4xConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s | |
4x1800W Redundant Power Supplies | |
Network | 1x Mellanox InfiniBand QDR Switch (8 ports)& HCAs (single port) + 7' cables |
1x 1U 16 Port Rackmount Switch, 10/100/1000, Unmanaged (+ 7' cables) | |
Power | 2xPDU, Basic, 1U, 30A, 208V, (10) C13, Requires 1x L6-30 Power Outlet Per PDU |
Software | CentOS, Bright Cluster Management (1 year support) |
Amber12 (cluster install), Lammps (shared filesystem), (Barracuda for weirlab?) | |
Warranty | 3 Year Parts and Labor (EC technical support?) |
GPU Teraflops | 18.72 double, 56.32 single |
Quote | <!-- $93,600 + S&H --> Arrived |
HP 19nov12: meeting notes
HP Quote
http://h18004.www1.hp.com/products/quickspecs/14405_div/14405_div.HTML
Topic | Description |
---|---|
General | 6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node |
Head Node | None |
Chassis | 2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank |
Nodes | 3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series |
3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb) | |
3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio | |
3x2x10/100/1000 NIC, Dedicated IPMI Port, 3x8x PCIE 3.0 x16 Slots (GPU), 3x2x PCIE 3.0 x8 | |
3x2xIB interconnect, QDR 40Gb/s, FlexibleLOM goes into PCI3x8 slot | |
chassis supplied power; 3x1x one PDU power cord (416151-B21)? - see below | |
Network | 1xVoltaire QDR 36-port infiniband 40 Gb/s switch, + 6x 5M QSFP IB cables |
No ethernet switch, 17x 7' CAT5 RJ45 cables | |
Power | rack PDU ready, what is 1x HP 40A HV Core Only Corded PDU??? |
Software | RHEL, CMU GPU enabled (1 year support) - not on quote??? |
Warranty | 3 Year Parts and Labor (HP technical support?) |
GPU Teraflops | 21.06 double, 63.36 single |
Quote | <!-- $128,370, for a 1x6500+2xSl250 setup estimate is $95,170 --> Arrived (S&H and insurance?) |
AX Quote
http://www.amax.com/hpc/productdetail.asp?product_id=simcluster Fremont, CA
Topic | Description |
---|---|
General | 8 CPUs (48 cores), 12 GPUs (30,000 cuda cores), 64 gb ram/node, plus head node |
Head Node | 1x1U Rackmount System, 2x Intel Xeon E5-2620 2.0GHz (12 cores total) |
64GB DDR3 1333MHz (max 256gb), 2×10/100/1000 NIC, 2x PCIe x16 Full | |
2x1TB (Raid 1) 7200RPM, InfiniBand adapter card, Single-Port, QSFP 40Gb/s | |
???w Power Supply, CentOS | |
Nodes | 4x1U, 4x2xIntel Xeon E5-2650 2.0GHz, with 6 cores (12cores/node) Romley series |
4x96GB 240-Pin DDR3 1600 MHz (96gb/node memory, 8gb/gpu, max 256gb) | |
4x1TB 7200RPM, 12xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio | |
2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots | |
4xInfiniband adapter card, Single-Port, QSFP 40Gb/s | |
4x??00W Redundant Power Supplies | |
Network | 1x Infiniband Switch (18 ports)& HCAs (single port) + ?' cables |
1x 1U 24 Port Rackmount Switch, 10/100/1000, Unmanaged (+ ?' cables) | |
Power | there are 3 rack PDUs? What are the connectors, L6-30? |
Software | CUDA only |
Warranty | 3 Year Parts and Labor (AX technical support?) |
GPU Teraflops | 14.04 double, 42.96 single |
Quote | <!-- $73,965 (S&H $800 included) --> Arrived |
MW Quote
http://www.microway.com/tesla/clusters.html Plymouth, MA
Topic | Description |
---|---|
General | 8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 32 gb ram/node, plus head node |
Head Node | 1x2U Rackmount System, 2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores |
8x4GB 240-Pin DDR3 1600 MHz ECC (max 512gb), 2×10/100/1000 NIC, 3x PCIe x16 Full, 3x PCIe x8 | |
2x1TB 7200RPM (Raid 1) + 6x2TB (Raid 6), Areca Raid Controller | |
Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s | |
740w Power Supply 1+1 redundant | |
Nodes | 4x1U Rackmountable Chassis, 4×2 Xeon E5-2650 2.0 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series |
4x8x4GB 240-Pin DDR3 1600 MHz (32gb/node memory, 8gb/gpu, max 256gb) | |
4x1x120GB SSD 7200RPM, 4x4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio | |
2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots | |
4xConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s | |
4x1800W (non) Redundant Power Supplies | |
Network | 1x Mellanox InfiniBand FDR Switch (36 ports)& HCAs (single port) + 3m cable FDR to existing Voltaire switch |
1x 1U 48 Port Rackmount Switch, 10/100/1000, Unmanaged (cables) | |
Rack | |
Power | 2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P) |
Software | CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA 5 |
scheduler and gnu compilers installed and configured | |
Amber12, Lammps, Barracuda (for weirlab?), and others if desired …bought through MW | |
Warranty | 3 Year Parts and Labor (lifetime technical support) |
GPU Teraflops | 18.72 double, 56.32 single |
Quote | <!-- estimated at $95,800 --> Arrived, includes S&H and Insurance |
Upgrades | Cluster pre-installation service |
5×2 E5-2660 2.20 Ghz 8 core CPUs | |
5x upgrade to 64 GB per node |