cluster:107

GPU History
What is a GPU cluster?
Nvidia Telsa K20
Intel Xeon Phi
GPU enabled software
GPU Tools
GPU Programming
Phi Programming
The Scheduler Problem
Things To Track
TechDocs
Mantra
Yale Qs
GPU Specs

GPU History

http://en.wikipedia.org/wiki/Graphics_processing_unit

What is a GPU cluster?

CPU = Central Processing Unit
- the chip on the motherboard with L1/L2 caches and often comprised of cores (like dual quad or 8)
- each core typically processes one computing job
- kernel also has ability to swap to disk (not a desired long term state)

GPU = Graphics Processing Unit
- also a chip, no swap, and comprised of many tiny cores (up to thousands per GPU)
- can process for example 10 million polygons per second
- often used for 3D modeling
- one can not add memory to card

“GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications.”
source: http://www.nvidia.com/object/what-is-gpu-computing.html

By connecting a CPU with a GPU (with a specialty card), a job running in a CPU core can offload intensive calculations to the GPU which can in parallel fashion quickly do redundant calculations.

Here is a picture of a dual CPU rack server and a dual GPU rack server connected:
http://www.advancedclustering.com/images/stories/gpu/gpu_cluster_header-600px.png

One rack contains two dual core CPUs and the other rack 4 GPUs (448 cores per GPU).
Double precision peak performance is 2 Teraflops for 2U of rack space.
- (Compare this to all our 3 clusters combined that together get close to 2.7 Teraflops in 4 racks!).
There needs to be at least 3GB of memory per GPU so 12GB for two CPUs with 4 cores linked to 4 GPUs.
- (This unit can hold up to 6GB per GPU allowing data to be cached in local CPU memory).

The leader in the area is Nvidia which produces Tesla GPUs (S2050/S2070 and K10/K20). Nvidia also has developed the CUDA parallel programming model making the GPU programmable with C/C++ and Fortran (and others, see below). This CUDA (software and drivers) needs to be installed and may alter the kernel (look this up?).

So once you stack multiple CPU/GPU units together, via gigabit or 10-gigabit ethernet or infiniband, you have a compute cluster.

GPU clusters perform at 1/10th the cost and 1/20th the power of CPU-only systems based on the latest quad-core CPUs.

Nvidia Telsa K20

Nvidia's top of the line GPU

1.17 teraflops peak double precision performance (K20X=1.31, K10=0.19, very poor in double precision, avoid)
3.52 teraflops peak single precision performance (K20X=3.95, K10=4.58)
delivers 10 times the performance of a single CPU for 1/20th of the power (seems to be a standard)
2496 CUDA cores per GPU
CPU memory 5GB GDDR5

ExxactCorp bundles K20 GPU with CPUs into a “simcluster”, think over sized coffee table on wheels

4 nodes with Infiniband
- (need cables and voltaire compatibility, and another voltaire switch model 4036 to HP rack)
2 Tesla K20 GPUs per node, total is 8 GPUs
2 Westmere X5650 Xeon 2.66 Ghz CPUs per node , total is 8 CPUs
6x 4GB DDR3-1333 (24GB/Node) is 24 GB/node, total is 96 GB
for a total of 9.36 Teraflops double precision mode,
for a total of 28.16 Teraflops single precision mode.
1 or 2 TB single disk (can be more)
Amber preinstall (there is a charge, it's very complicated, v12)
3 year warranty
picture http://www.google.com/imgres?imgurl=http://images.anandtech.com/doci/6446/TeslaK20.jpg&imgrefurl=http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last&h=2037&w=2500&sz=988&tbnid=JpjMdigXE1LQqM:&tbnh=90&tbnw=110&zoom=1&usg=__kqo6Ny7C_sc5fL6oIlIyb_P4MXU=&docid=tVrJJ13rwkfi0M&sa=X&ei=ZW0wUu3BE4iC9QT_oYGABw&sqi=2&ved=0CGIQ9QEwBg&dur=41
quote coming (see below)
can i plug the power cords into HP rack PDU (PDU to PDU? Answer is no, unless you just buy the rack servers.)

Intel Xeon Phi

Intel new multi core platform. Intel is not going the GPU route.

picture http://www.itworldcanada.com/news/intel-ships-xeon-phi-coprocessor/146377
- box of cores on fire, goes into PCI-Express 2.0 slot
- runs it's own linux on top of system OS
The first “Knights Corner” chip to ship is the Xeon Phi 5110P (passively cooled)
- which has 60 cores and a clock speed of 1.05 GHz (really??)
- It offers 1,011 gigaflops (1 Teraflop) of peak double-precision performance
  - that's impressive with that clock speed
- 30 GB of cache and memory capacity of 6 GB. (that's different than system memory I assume)
To match hp12 12 gb/node availability for 8 cores you'd need
- 90 gb memory per 60 cores which is equivalent to 12 gb/8 cores
- double that if you 24 gb/8 cores type of availability.

“Intel is providing software tools so applications can be written or recompiled for the Phi chips. Curley said it is easy to recompile existing x86 code so that high-performance applications can take full advantage of the multicore chips.”
Source:http://www.computerworld.com/s/article/9233498/Intel_ships_60_core_Xeon_Phi_processor?taxonomyId=162&pageNumber=2

So a recompile is needed anyway but not a recode effort like for GPU enabled software, check.
Need to get a quote from somewhere (AC or HP)
So a 4 node set of 2U servers would deliver 4 Teraflops double precision, or about 50% of GPU performance.

GPU enabled software

So the problem with GPU is that the code needs to change (just like MPI parallel programs). But there is no “GPU” compiler, there is a whole new development environment and tools (see next section). The job running on the CPU core needs to load data into the GPU cores, act on it, and retrieve the results and march on, or redo the operation another bazilion times. That takes code. Matlab, in a wave function rewrite example, shows that 85% of the original code remains untouched; still that's a 15% redo rate.

Digging around, there are third parties that do this (it is massively complicated and driver/card hardware dependent). Some commercial software is GPU enabled and some open source software supports it. (But there are entire lists just dedicated to compiling for GPU). Seems like a serious drawback.

Nvidia maintains a “Catalog of available software” which includes Amber, Lammps, Matlab, Mathematica, and Gaussian (in dev); we'll take a closer look at those next. (Might have to include a software cost component in GPU cluster cost. Can we even buy ready to go software? check).
http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf

Application
- Description
- Supported features
- Expected speed up ←– nvidia in home tests
- Multi GPU support
- Release status

Amber
- Suite of programs to simulate molecular dynamics on biomolecules
- PMEMD: explicit and implicit solvent (Amber v12)
- 89.44 ns/day JAC NVE (gosh you got me)
- Yes
- Available now,v12

Not sure if Amber/GPU is ready-to-go or you must do it yourself. Amber/MPI compilation is a nightmare this is even worse
http://ambermd.org/gpus/ pay particular attention to the scheduler discussion (GPU vs CPU “job slots”)

“GPU accelerated PMEMD has been implemented using CUDA and thus will only run on NVIDIA GPUs at present.”
“the code makes use of fixed and double precision in several places” … thus a requirement
“ plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better” … thus a requirement
“experience to date shows that even up to 8 GPUs per node the performance impact is minimal”
“ the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation”
“There can also be differences between CPU [only] and [CPU]/GPU runs … the random number stream is different between CPUs and GPUs.”
“GPUDirect peer-to-peer transfers and memory access are supported natively by the CUDA Driver. All you need is CUDA Toolkit v4.0 and R270 drivers (or later)“
gpu-info and gpu-free scripts

Ah, here we go, custom order an amber GPU
http://ambermd.org/gpus/recommended_hardware.htm
and http://exxactcorp.com/ if we go GPU we should get a quote (we could run other stuff on top of this too I suppose, check?)
Get DLBgroup to start testing.
(check this out, it also lists some prices) and provides a testing site to check on speed ups expected with your code
http://exxactcorp.com/testdrive/AMBER/

Your mileage may vary http://ambermd.org/gpus/benchmarks.htm … this is beyond me; double precision needed it looks like.

Lammps
- Classical molecular dynamics package
- Lennard-Jones, Gay-Berne (latest release)
- 3.5-15x
- Yes
- Available now

You can compile your own with the –cuda option (seems complicated).
Oh, another gem, Lammps GPU cluster ready to go (+testing): http://exxactcorp.com/testdrive/LAMMPS/
Get StarrGroup to start testing.

SideBar: Looking at the Exacct site, they build clusters for: Amber, Barracuda (for weir?), Lammps, Namd, and more http://exxactcorp.com/testdrive/
I need to get a K20 cluster quote! With pre-loaded software if we do this.
http://exxactcorp.com/index.php/solution/solu_list/12

Mathematica
- A symbolic technical computing language and development environment
- Development environment for CUDA and OpenCL
- 2-20x
- Yes
- Available now

This would be a purchase. The low end rates seem barely worth it. I believe the “gpu enabled” Mathematica is a different software install (check).

Matlab
- GPU acceleration for MATLAB (high-level Technical Computing Language)
- Support for over 200 of the most commonly used MATLAB functions
- 2-20x
- Yes
- Available now

So here is an example of commercial software that support Nvidia GPUs
http://www.mathworks.com/discovery/matlab-gpu.html
built into the Parallel Toolbox (that we already have, I do not believe this is a different software install). For example, instead of the function getArray{} you would recode it to GPUgetArray{}.

VMD
- Visualization and analyzing large bio-molecular systems in 3-D graphics
- GPU acceleration for computationally demanding analysis and visualization tasks
- 100-125x Kernels
- Yes
- Available now Version 1.9.x

Now we're talking speed up, that seems worth an effort. But we have low VMD usage.

It appears “SAS On Demand”, that is remote access to SAS on a virtual classroom machine for facculty and staff, is powered by an IBM BladeCenter leveraging GPU technology.
http://decisionstats.com/2010/09/16/sasbladesservers-gpu-benchmarks/

GPU Tools

Development Tools
https://developer.nvidia.com/cuda-tools-ecosystem

GPU cluster management (includes Platform LSF so identical to Lava which we use)
https://developer.nvidia.com/cluster-management

Webinars |http://www.gputechconf.com/page/gtc-express-webinar.html

GPU Programming

GPU Programming For The Rest Of Us
http://www.clustermonkey.net/Applications/gpu-programming-for-the-rest-of-us.html
Excellent article on the GPU programming problem

“The Portland Group® (PGI), a wholly-owned subsidiary of STMicroelectronics and the leading independent supplier of compilers and tools for high-performance computing, announced at SC12 today that thePGI 2013 release of its PGI Accelerator™ compilers due out in early December will add support for the new family of NVIDIA® Tesla® K20 GPU accelerators and the CUDA® 5 parallel computing platform and programming model.”

” … and … today announced plans to extend its PGI Accelerator™ compiler technology with OpenACC to Intel® Xeon Phi™ coprocessors, Intel's family of products based on the Intel Many Integrated Core (MIC) architecture.”

Source: http://www.pgroup.com/about/news.htm#56

So that solves some big items (it compiles for both GPU and Phi coprocessor):

No prior experience with OpenCL, CUDA or other low-level programming models required. Ideal for domain experts.
Well-designed algorithms that would perform well when written using a low-level programming model can perform just as well using directives.
No separate co-processor source code required. Compile the same program for multi-core CPUs using PGI or any other standard-compliant compiler.
Supports GPU accelerators and co-processors from multiple vendors.
Developers can port and tune parts of their application as time permits. No wholesale rewrite required.
Most developers see results with modest effort.
(Intel is not doing anything on the GPU side of things, only on the coprocessor side of things)

What the compiler does is inert pragmas (compiler directives) that are executed if a GPU (or Phi processor) are detected, otherwise the code just excutes like normal. Looks like:

!$acc region
   !$acc do parallel
   do j=1,m
    ...

More examples using the PGC compiler and openACC from Microway http://microway.com/hpc-tech-tips/2013/04/5-easy-first-steps-on-gpus-accelerating-your-applications/

Phi Programming

Here is a very detailed article with excerpts below
http://www.theregister.co.uk/2012/11/12/intel_xeon_phi_coprocessor_launch/

“Many Integrated Core” architecture (MIC)
generally available on January 28 to the rest of us for $2,649
Xeon Phi chips run Linux and an OpenMP multiprocessing as well as the message passing interface (MPI) protocol
Intel's C, C++, and Fortran compilers in its Parallel Studio XE set as well as the Cluster Studio XE extensions work on Xeon Phi chips.
a server with two Xeon E5-2670 processors, adding a single Xeon Phi card can boost the performance of various HPC workloads by between a factor of 2.2 to 2.9

“You add parallel directives to the code, and you compile the code to run on both x86 chips in standalone mode and on the x86-Xeon Phi combination. You get one set of compiled code, and if the Xeon Phi chips are present, the work is offloaded from the server CPUs to the x86 coprocessors, and they do the acceleration. If not, the CPUs in the cluster or the workstation do the math.”

That explains that in detail. So this is the same approach as the Portland GPU compiler. One compiled binary. But potentially a lower performance boost than the GPU approach.

The Scheduler Problem

While anticipating a hardware acquisition to delve into the world of GPU computing, I'm trying to abstract out the perceived Scheduler problem. For the sake of this discussion lets assume one 4U rack with 2 CPUs and 4 GPUs. Lets also assume the CPUs are Intel E5-2660 8 cores per CPU 2.0 Ghz (total of 16 cpu cores, “Romnley” series so each CPU can see all GPUs, seems to be the current trend). Assume the GPUs are Nvidia Telsa K20 (roughly 2,500 cuda cores per GPU). We have 4 GB DDR3 memory per GPU, total of 16 GB, and need at least that much for CPUs, but we'll load the board with 64 GB (more on that later).

To generalize; compute jobs that would run on the CPU cores only, in serial or parallel fashion, we'll call cpuHPC. Jobs that run on the GPU cores we'll call gpuHPC. From a Scheduler perspective then we have 16 job slots on the cpuHPC. And 4 perceived job slots on the gpuHPC. But a gpuHPC job, serial or parallel, runs it's serial code on a cpuHPC core and offloads any parallel computations (like loops) to the gpuHPC (serial or parallel). Thus at least 4 of the cpuHPC jobs slots need to be reserved for the gpuHPC, more if you will allow many small gpuHPC jobs, but there are more complications.

In a simple setup you would not allow any non-gpuHPC jobs on this hardware. Even then the Scheduler needs to be aware if GPUs are idle or busy. Nvidia's CUDA SDK provides tools for this like deviceQuery which returns gpuIDs (0-3 in our case). The cpuIDs are provided by the OS (see /proc/cpuinfo, 0-1 in our example). There is a script (gpu-info) that returns the gpuIDs and % utilization. One approach then is to set CUDA_VISIBLE_DEVICES on the node where the gpuHPC job will run and expose only idle GPUs (script gpu-free). Source:http://ambermd.org/gpus/#Running. So in this case the Scheduler can only submit up to 4 gpuHPC serial jobs allowing each gpuHPC job to claim one idle GPU. Less if a parallel MPI gpuHPC job runs across multiple GPUs.

So we need to build some functionality into the submit jobs process that exposes idle GPUs. But how could we do that before submission? IOW let the jobs PEND until an idle GPU is available? We need to build an “elim” script that reports number of available GPUs (integer) and define this Lava resource (lsf.shared). Submit scripts then need to request this resource and quantity. That's perhaps a start, and efficient, if each gpuHPC job used say 99% of the GPU's computational power.

Gets way more complicated if that is not the case. Lets assume a gpuHPC parallel MPI job does not fit inside one GPU but needs 4 GPUs and at peak time utilization is 60% per GPU. (In this scenario no other gpuHPC parallel MPI jobs should be submitted, only serial gpuHPC jobs that can use up to 40% of that computational power of the, now busy, GPUs). Complicated. First the cpuHPC job needing all 4 GPUs needs to make sure: 4 are available and idle, when submission occurs 4 CPU cores need to allocated (does not matter which ones in a one node scenario) and each CPU core gets a unique gpuID assigned. We can do that with the scripts above. Second, the MPI thread count is 4 (-n 4) but we need to make sure they land on one host (span[hosts=1]) and presumably reserve some memory on CPU side (rusage[mem=…]). That can all be done in the one node scenario. In a multiple node setting the hosts=1 constrains the idle GPUs to be on one node.

We could also design a smarter “elim” resource and have % utilization reported (0-100) and use that. (Would we know how much of the GPU computational power we need?). How would one keep a second (parallel MPI) gpuHPC job from being submitted after one is already running? Maybe another elim resource that looks at the presence of a lock file somewhere (0/1) created bu a PREEXEC script scanning for mpirun invocations?

And what happens when you have more than one node? Can a hosts=2 setting a request for 8 GPUs using the MPI GPU enabled binary work? Don't know, we'll see.

Regarding the memory. If one works along the one gpuHPC job per GPU model, then the hardware has a maximm of 4 simultaneous jobs in this example. That leaves 12 cpuHPC cores idle and one can envision setting up another queue for those. So in a minimum memory foot print I'd suggest 16x4GB but if you can afford it, double that.

Start simple.

(As posted on openlava.org forum confirming some things)

Things To Track

power/cooling requirements, connections (L6-30 or rack PDU) and rack U needed)
voltaire compatibility (and expansion) or HCA PCI2ex16 slot 1 (full height, full width riser) to cabinet
gaussian; in dev but K20/K20X compliant, what about cuda v4.2 vs 5.0?
local storage on GPU HPC?
lava configuration (cpu-cores to GPU ratios)

For final quote, add spare parts like HCA cards and (long) IB cables

TechDocs

http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf
Excllent technical document on GPU clusters, programming and scheduler implications
Pre/Post job submission GPU (de)allocations and GPU scrubbing
Portland PGI compiler discussion
MPI compiling using 'nvcc'
yea, it can get scary …

http://docs.nvidia.com/cuda/index.html
install guide
nvcc guide
…

Mantra

Had this thought during my dental cleaning … which I think may be useful in framing our decision:

Go for the GPU HPC solution (because of the teraflops boost)
And if it does not work out, be content with the CPU HPC left (# cores, ram gb/core, etc)

buy a single rack and test locally, start small (will future racks be compatible?)

Yale Qs

Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC …

What was the most important design element of the cluster?
What factor(s) settled the CPU to GPU ratio?
Was either, or neither, single or double precision peak performance more/less important?
What was the software suite in mind (commercial, open source, or custom code GPU “enabled”)?
How did you reach out/educate users on the aspects of GPU computing?
What was the impact on the users? (recoding, recompiling)
Was the expected computational speed up realized?
Was the PGI Accelerator compilers leveraged? If so what were the results?
Do users compile with nvcc?
Does the scheduler have a resource for idle GPUs so they can be reserved?
How are the GPUs exposed/assigned to jobs the scheduler submits?
Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
Can parallel jobs access mutliple GPUs across nodes?
Any experiences with pmemd.cuda.MPI (part of Amber)?
What MPI flavor is used most in regards to GPU computing?
Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)

Notes 04/01/2012 ConfCall

Applications drive the CPU-to-GPU ratio and most will be 1-to-1, certainly not larger then 1-to-3
Users did not share GPUs but could obtain more than one, always on same node
Experimental setup with 36 gb/node, dual 8 core chips
Nothing larger than that memory wise as CPU and GPU HPC work environments were not mixed
No raw code development
Speed ups was hard to tell
PGI Accelerator was used because it is needed with any Fortran code (Note!)
Double precision was most important in scientific applications
MPI flavor was OpenMPI, and others (including MVApich) showed no advantages
Book: Programming Massively Parallel Processors, Second Edition:
- A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu (Dec 28, 2012)
- Has examples of how to expose GPUs across nodes

GPU Specs

Moved to separate page

Back

Table of Contents