“GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications.”
source: http://www.nvidia.com/object/what-is-gpu-computing.html
By connecting a CPU with a GPU (with a specialty card), a job running in a CPU core can offload intensive calculations to the GPU which can in parallel fashion quickly do redundant calculations.
Here is a picture of a dual CPU rack server and a dual GPU rack server connected:
http://www.advancedclustering.com/images/stories/gpu/gpu_cluster_header-600px.png
The leader in the area is Nvidia which produces Tesla GPUs (S2050/S2070 and K10/K20). Nvidia also has developed the CUDA parallel programming model making the GPU programmable with C/C++ and Fortran (and others, see below). This CUDA (software and drivers) needs to be installed and may alter the kernel (look this up?).
So once you stack multiple CPU/GPU units together, via gigabit or 10-gigabit ethernet or infiniband, you have a compute cluster.
GPU clusters perform at 1/10th the cost and 1/20th the power of CPU-only systems based on the latest quad-core CPUs.
Nvidia's top of the line GPU
ExxactCorp bundles K20 GPU with CPUs into a “simcluster”, think over sized coffee table on wheels
Intel new multi core platform. Intel is not going the GPU route.
“Intel is providing software tools so applications can be written or recompiled for the Phi chips. Curley said it is easy to recompile existing x86 code so that high-performance applications can take full advantage of the multicore chips.”
Source:http://www.computerworld.com/s/article/9233498/Intel_ships_60_core_Xeon_Phi_processor?taxonomyId=162&pageNumber=2
So the problem with GPU is that the code needs to change (just like MPI parallel programs). But there is no “GPU” compiler, there is a whole new development environment and tools (see next section). The job running on the CPU core needs to load data into the GPU cores, act on it, and retrieve the results and march on, or redo the operation another bazilion times. That takes code. Matlab, in a wave function rewrite example, shows that 85% of the original code remains untouched; still that's a 15% redo rate.
Digging around, there are third parties that do this (it is massively complicated and driver/card hardware dependent). Some commercial software is GPU enabled and some open source software supports it. (But there are entire lists just dedicated to compiling for GPU). Seems like a serious drawback.
Nvidia maintains a “Catalog of available software” which includes Amber, Lammps, Matlab, Mathematica, and Gaussian (in dev); we'll take a closer look at those next. (Might have to include a software cost component in GPU cluster cost. Can we even buy ready to go software? check).
http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf
Not sure if Amber/GPU is ready-to-go or you must do it yourself. Amber/MPI compilation is a nightmare this is even worse
http://ambermd.org/gpus/ pay particular attention to the scheduler discussion (GPU vs CPU “job slots”)
Ah, here we go, custom order an amber GPU
http://ambermd.org/gpus/recommended_hardware.htm
and
http://exxactcorp.com/ if we go GPU we should get a quote (we could run other stuff on top of this too I suppose, check?)
Get DLBgroup to start testing.
(check this out, it also lists some prices) and provides a testing site to check on speed ups expected with your code
http://exxactcorp.com/testdrive/AMBER/
Your mileage may vary http://ambermd.org/gpus/benchmarks.htm … this is beyond me; double precision needed it looks like.
You can compile your own with the –cuda option (seems complicated).
Oh, another gem, Lammps GPU cluster ready to go (+testing): http://exxactcorp.com/testdrive/LAMMPS/
Get StarrGroup to start testing.
SideBar: Looking at the Exacct site, they build clusters for: Amber, Barracuda (for weir?), Lammps, Namd, and more http://exxactcorp.com/testdrive/
I need to get a K20 cluster quote! With pre-loaded software if we do this.
http://exxactcorp.com/index.php/solution/solu_list/12
This would be a purchase. The low end rates seem barely worth it. I believe the “gpu enabled” Mathematica is a different software install (check).
So here is an example of commercial software that support Nvidia GPUs
http://www.mathworks.com/discovery/matlab-gpu.html
built into the Parallel Toolbox (that we already have, I do not believe this is a different software install). For example, instead of the function getArray{} you would recode it to GPUgetArray{}.
Now we're talking speed up, that seems worth an effort. But we have low VMD usage.
It appears “SAS On Demand”, that is remote access to SAS on a virtual classroom machine for facculty and staff, is powered by an IBM BladeCenter leveraging GPU technology.
http://decisionstats.com/2010/09/16/sasbladesservers-gpu-benchmarks/
Development Tools
https://developer.nvidia.com/cuda-tools-ecosystem
GPU cluster management (includes Platform LSF so identical to Lava which we use)
https://developer.nvidia.com/cluster-management
Webinars |http://www.gputechconf.com/page/gtc-express-webinar.html
GPU Programming For The Rest Of Us
http://www.clustermonkey.net/Applications/gpu-programming-for-the-rest-of-us.html
Excellent article on the GPU programming problem
“The Portland Group® (PGI), a wholly-owned subsidiary of STMicroelectronics and the leading independent supplier of compilers and tools for high-performance computing, announced at SC12 today that thePGI 2013 release of its PGI Accelerator™ compilers due out in early December will add support for the new family of NVIDIA® Tesla® K20 GPU accelerators and the CUDA® 5 parallel computing platform and programming model.”
” … and … today announced plans to extend its PGI Accelerator™ compiler technology with OpenACC to Intel® Xeon Phi™ coprocessors, Intel's family of products based on the Intel Many Integrated Core (MIC) architecture.”
Source: http://www.pgroup.com/about/news.htm#56
So that solves some big items (it compiles for both GPU and Phi coprocessor):
What the compiler does is inert pragmas (compiler directives) that are executed if a GPU (or Phi processor) are detected, otherwise the code just excutes like normal. Looks like:
!$acc region !$acc do parallel do j=1,m ...
More examples using the PGC compiler and openACC from Microway http://microway.com/hpc-tech-tips/2013/04/5-easy-first-steps-on-gpus-accelerating-your-applications/
Here is a very detailed article with excerpts below
http://www.theregister.co.uk/2012/11/12/intel_xeon_phi_coprocessor_launch/
“You add parallel directives to the code, and you compile the code to run on both x86 chips in standalone mode and on the x86-Xeon Phi combination. You get one set of compiled code, and if the Xeon Phi chips are present, the work is offloaded from the server CPUs to the x86 coprocessors, and they do the acceleration. If not, the CPUs in the cluster or the workstation do the math.”
That explains that in detail. So this is the same approach as the Portland GPU compiler. One compiled binary. But potentially a lower performance boost than the GPU approach.
While anticipating a hardware acquisition to delve into the world of GPU computing, I'm trying to abstract out the perceived Scheduler problem. For the sake of this discussion lets assume one 4U rack with 2 CPUs and 4 GPUs. Lets also assume the CPUs are Intel E5-2660 8 cores per CPU 2.0 Ghz (total of 16 cpu cores, “Romnley” series so each CPU can see all GPUs, seems to be the current trend). Assume the GPUs are Nvidia Telsa K20 (roughly 2,500 cuda cores per GPU). We have 4 GB DDR3 memory per GPU, total of 16 GB, and need at least that much for CPUs, but we'll load the board with 64 GB (more on that later).
To generalize; compute jobs that would run on the CPU cores only, in serial or parallel fashion, we'll call cpuHPC. Jobs that run on the GPU cores we'll call gpuHPC. From a Scheduler perspective then we have 16 job slots on the cpuHPC. And 4 perceived job slots on the gpuHPC. But a gpuHPC job, serial or parallel, runs it's serial code on a cpuHPC core and offloads any parallel computations (like loops) to the gpuHPC (serial or parallel). Thus at least 4 of the cpuHPC jobs slots need to be reserved for the gpuHPC, more if you will allow many small gpuHPC jobs, but there are more complications.
In a simple setup you would not allow any non-gpuHPC jobs on this hardware. Even then the Scheduler needs to be aware if GPUs are idle or busy. Nvidia's CUDA SDK provides tools for this like deviceQuery which returns gpuIDs (0-3 in our case). The cpuIDs are provided by the OS (see /proc/cpuinfo, 0-1 in our example). There is a script (gpu-info) that returns the gpuIDs and % utilization. One approach then is to set CUDA_VISIBLE_DEVICES on the node where the gpuHPC job will run and expose only idle GPUs (script gpu-free). Source:http://ambermd.org/gpus/#Running. So in this case the Scheduler can only submit up to 4 gpuHPC serial jobs allowing each gpuHPC job to claim one idle GPU. Less if a parallel MPI gpuHPC job runs across multiple GPUs.
So we need to build some functionality into the submit jobs process that exposes idle GPUs. But how could we do that before submission? IOW let the jobs PEND until an idle GPU is available? We need to build an “elim” script that reports number of available GPUs (integer) and define this Lava resource (lsf.shared). Submit scripts then need to request this resource and quantity. That's perhaps a start, and efficient, if each gpuHPC job used say 99% of the GPU's computational power.
Gets way more complicated if that is not the case. Lets assume a gpuHPC parallel MPI job does not fit inside one GPU but needs 4 GPUs and at peak time utilization is 60% per GPU. (In this scenario no other gpuHPC parallel MPI jobs should be submitted, only serial gpuHPC jobs that can use up to 40% of that computational power of the, now busy, GPUs). Complicated. First the cpuHPC job needing all 4 GPUs needs to make sure: 4 are available and idle, when submission occurs 4 CPU cores need to allocated (does not matter which ones in a one node scenario) and each CPU core gets a unique gpuID assigned. We can do that with the scripts above. Second, the MPI thread count is 4 (-n 4) but we need to make sure they land on one host (span[hosts=1]) and presumably reserve some memory on CPU side (rusage[mem=…]). That can all be done in the one node scenario. In a multiple node setting the hosts=1 constrains the idle GPUs to be on one node.
We could also design a smarter “elim” resource and have % utilization reported (0-100) and use that. (Would we know how much of the GPU computational power we need?). How would one keep a second (parallel MPI) gpuHPC job from being submitted after one is already running? Maybe another elim resource that looks at the presence of a lock file somewhere (0/1) created bu a PREEXEC script scanning for mpirun invocations?
And what happens when you have more than one node? Can a hosts=2 setting a request for 8 GPUs using the MPI GPU enabled binary work? Don't know, we'll see.
Regarding the memory. If one works along the one gpuHPC job per GPU model, then the hardware has a maximm of 4 simultaneous jobs in this example. That leaves 12 cpuHPC cores idle and one can envision setting up another queue for those. So in a minimum memory foot print I'd suggest 16x4GB but if you can afford it, double that.
Start simple.
(As posted on openlava.org forum confirming some things)
Had this thought during my dental cleaning … which I think may be useful in framing our decision:
Or
Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC …
Notes 04/01/2012 ConfCall