DokuWiki

This is an old revision of the document!

GPU History

http://en.wikipedia.org/wiki/Graphics_processing_unit

What is a GPU cluster?

CPU = Central Processing Unit
- the chip on the motherboard with L1/L2 caches and often comprised of cores (like dual quad or 8)
- each core typically processes one computing job
- kernel also has ability to swap to disk (not a desired long term state)

GPU = Graphics Processing Unit
- also a chip, no swap, and comprised of many tiny cores (up to thousands per GPU)
- can process for example 10 million polygons per second
- often used for 3D modeling
- one can not add memory to card

“GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications.”
source: http://www.nvidia.com/object/what-is-gpu-computing.html

By connecting a CPU with a GPU (with a specialty card), a job running in a CPU core can offload intensive calculations to the GPU which can in parallel fashion quickly do redundant calculations.

Here is a picture of a dual CPU rack server and a dual GPU rack server connected:
http://www.advancedclustering.com/images/stories/gpu/gpu_cluster_header-600px.png

One rack contains two dual core CPUs and the other rack 4 GPUs (448 cores per GPU).
Double precision peak performance is 2 Teraflops for 2U of rack space.
- (Compare this to all our 3 clusters combined that together get close to 2.7 Teraflops in 4 racks!).
There needs to be at least 3GB of memory per GPU so 12GB for two CPUs with 4 cores linked to 4 GPUs.
- (This unit can hold up to 6GB per GPU allowing data to be cached in local CPU memory).

The leader in the area is Nvidia which produces Tesla GPUs (S2050/S2070 and K10/K20). Nvidia also has developed the CUDA parallel programming model making the GPU programmable with C/C++ and Fortran (and others, see below). This CUDA (software and drivers) needs to be installed and may alter the kernel (look this up?).

So once you stack multiple CPU/GPU units together, via gigabit or 10-gigabit ethernet or infiniband, you have a compute cluster.

GPU clusters perform at 1/10th the cost and 1/20th the power of CPU-only systems based on the latest quad-core CPUs.

Nvidia Telsa K20

Nvidia's top of the line GPU

1.17 teraflops peak double precision performance (K20X=1.31, K10=0.19, very poor in double precision, avoid)
3.52 teraflops peak single precision performance (K20X=3.95, K10=4.58)
delivers 10 times the performance of a single CPU for 1/20th of the power (seems to be a standard)
2496 CUDA cores per GPU
CPU memory 5GB GDDR5

ExxactCorp bundles K20 GPU with CPUs into a “simcluster”, think over sized coffee table on wheels

4 nodes with Infiniband
- (need cables and voltaire compatibility, and another voltaire switch model 4036 to HP rack)
2 Tesla K20 GPUs per node, total is 8 GPUs
2 Westmere X5650 Xeon 2.66 Ghz CPUs per node , total is 8 CPUs
6x 4GB DDR3-1333 (24GB/Node) is 24 GB/node, total is 96 GB
for a total of 9.36 Teraflops double precision mode,
for a total of 28.16 Teraflops single precision mode.
1 or 2 TB single disk (can be more)
Amber preinstall (there is a charge, it's very complicated, v12)
3 year warranty
picture http://exxactcorp.com/index.php/solution/solu_detail/84
quote coming (see below)
can i plug the power cords into HP rack PDU (PDU to PDU? Answer is no, unless you just buy the rack servers.)

Intel Xeon Phi

Intel new multi core platform. Intel is not going the GPU route.

picture http://www.itworldcanada.com/news/intel-ships-xeon-phi-coprocessor/146377
- box of cores on fire, goes into PCI-Express 2.0 slot
- runs it's own linux on top of system OS
The first “Knights Corner” chip to ship is the Xeon Phi 5110P (passively cooled)
- which has 60 cores and a clock speed of 1.05 GHz (really??)
- It offers 1,011 gigaflops (1 Teraflop) of peak double-precision performance
  - that's impressive with that clock speed
- 30 GB of cache and memory capacity of 6 GB. (that's different than system memory I assume)
To match hp12 12 gb/node availability for 8 cores you'd need
- 90 gb memory per 60 cores which is equivalent to 12 gb/8 cores
- double that if you 24 gb/8 cores type of availability.

“Intel is providing software tools so applications can be written or recompiled for the Phi chips. Curley said it is easy to recompile existing x86 code so that high-performance applications can take full advantage of the multicore chips.”
Source:http://www.computerworld.com/s/article/9233498/Intel_ships_60_core_Xeon_Phi_processor?taxonomyId=162&pageNumber=2

So a recompile is needed anyway but not a recode effort like for GPU enabled software, check.
Need to get a quote from somewhere (AC or HP)
So a 4 node set of 2U servers would deliver 4 Teraflops double precision, or about 50% of GPU performance.

GPU enabled software

So the problem with GPU is that the code needs to change (just like MPI parallel programs). But there is no “GPU” compiler, there is a whole new development environment and tools (see next section). The job running on the CPU core needs to load data into the GPU cores, act on it, and retrieve the results and march on, or redo the operation another bazilion times. That takes code. Matlab, in a wave function rewrite example, shows that 85% of the original code remains untouched; still that's a 15% redo rate.

Digging around, there are third parties that do this (it is massively complicated and driver/card hardware dependent). Some commercial software is GPU enabled and some open source software supports it. (But there are entire lists just dedicated to compiling for GPU). Seems like a serious drawback.

Nvidia maintains a “Catalog of available software” which includes Amber, Lammps, Matlab, Mathematica, and Gaussian (in dev); we'll take a closer look at those next. (Might have to include a software cost component in GPU cluster cost. Can we even buy ready to go software? check).
http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf

Application
- Description
- Supported features
- Expected speed up ←– nvidia in home tests
- Multi GPU support
- Release status

Amber
- Suite of programs to simulate molecular dynamics on biomolecules
- PMEMD: explicit and implicit solvent (Amber v12)
- 89.44 ns/day JAC NVE (gosh you got me)
- Yes
- Available now,v12

Not sure if Amber/GPU is ready-to-go or you must do it yourself. Amber/MPI compilation is a nightmare this is even worse
http://ambermd.org/gpus/ pay particular attention to the scheduler discussion (GPU vs CPU “job slots”)

“GPU accelerated PMEMD has been implemented using CUDA and thus will only run on NVIDIA GPUs at present.”
“the code makes use of fixed and double precision in several places” … thus a requirement
“ plan to run PMEMD are connected to PCI-E 2.0 x 16 lane slots or better” … thus a requirement
“experience to date shows that even up to 8 GPUs per node the performance impact is minimal”
“ the size of the GPU memory imposes hard limits on the number of atoms supported in a simulation”
“There can also be differences between CPU [only] and [CPU]/GPU runs … the random number stream is different between CPUs and GPUs.”
“GPUDirect peer-to-peer transfers and memory access are supported natively by the CUDA Driver. All you need is CUDA Toolkit v4.0 and R270 drivers (or later)“
gpu-info and gpu-free scripts

Ah, here we go, custom order an amber GPU
http://ambermd.org/gpus/recommended_hardware.htm
and http://exxactcorp.com/ if we go GPU we should get a quote (we could run other stuff on top of this too I suppose, check?)
Get DLBgroup to start testing.
(check this out, it also lists some prices) and provides a testing site to check on speed ups expected with your code
http://exxactcorp.com/testdrive/AMBER/

Your mileage may vary http://ambermd.org/gpus/benchmarks.htm … this is beyond me; double precision needed it looks like.

Lammps
- Classical molecular dynamics package
- Lennard-Jones, Gay-Berne (latest release)
- 3.5-15x
- Yes
- Available now

You can compile your own with the –cuda option (seems complicated).
Oh, another gem, Lammps GPU cluster ready to go (+testing): http://exxactcorp.com/testdrive/LAMMPS/
Get StarrGroup to start testing.

SideBar: Looking at the Exacct site, they build clusters for: Amber, Barracuda (for weir?), Lammps, Namd, and more http://exxactcorp.com/testdrive/
I need to get a K20 cluster quote! With pre-loaded software if we do this.
http://exxactcorp.com/index.php/solution/solu_list/12

Mathematica
- A symbolic technical computing language and development environment
- Development environment for CUDA and OpenCL
- 2-20x
- Yes
- Available now

This would be a purchase. The low end rates seem barely worth it. I believe the “gpu enabled” Mathematica is a different software install (check).

Matlab
- GPU acceleration for MATLAB (high-level Technical Computing Language)
- Support for over 200 of the most commonly used MATLAB functions
- 2-20x
- Yes
- Available now

So here is an example of commercial software that support Nvidia GPUs
http://www.mathworks.com/discovery/matlab-gpu.html
built into the Parallel Toolbox (that we already have, I do not believe this is a different software install). For example, instead of the function getArray{} you would recode it to GPUgetArray{}.

VMD
- Visualization and analyzing large bio-molecular systems in 3-D graphics
- GPU acceleration for computationally demanding analysis and visualization tasks
- 100-125x Kernels
- Yes
- Available now Version 1.9.x

Now we're talking speed up, that seems worth an effort. But we have low VMD usage.

It appears “SAS On Demand”, that is remote access to SAS on a virtual classroom machine for facculty and staff, is powered by an IBM BladeCenter leveraging GPU technology.
http://decisionstats.com/2010/09/16/sasbladesservers-gpu-benchmarks/

GPU Tools

Development Tools
https://developer.nvidia.com/cuda-tools-ecosystem

GPU cluster management (includes Platform LSF so identical to Lava which we use)
https://developer.nvidia.com/cluster-management

GPU Programming

GPU Programming For The Rest Of Us
http://www.clustermonkey.net/Applications/gpu-programming-for-the-rest-of-us.html
Excellent article on the GPU programming problem

“The Portland Group® (PGI), a wholly-owned subsidiary of STMicroelectronics and the leading independent supplier of compilers and tools for high-performance computing, announced at SC12 today that thePGI 2013 release of its PGI Accelerator™ compilers due out in early December will add support for the new family of NVIDIA® Tesla® K20 GPU accelerators and the CUDA® 5 parallel computing platform and programming model.”

” … and … today announced plans to extend its PGI Accelerator™ compiler technology with OpenACC to Intel® Xeon Phi™ coprocessors, Intel's family of products based on the Intel Many Integrated Core (MIC) architecture.”

Source: http://www.pgroup.com/about/news.htm#56

So that solves some big items (it compiles for both GPU and Phi coprocessor):

No prior experience with OpenCL, CUDA or other low-level programming models required. Ideal for domain experts.
Well-designed algorithms that would perform well when written using a low-level programming model can perform just as well using directives.
No separate co-processor source code required. Compile the same program for multi-core CPUs using PGI or any other standard-compliant compiler.
Supports GPU accelerators and co-processors from multiple vendors.
Developers can port and tune parts of their application as time permits. No wholesale rewrite required.
Most developers see results with modest effort.
(Intel is not doing anything on the GPU side of things, only on the coprocessor side of things)

What the compiler does is inert pragmas (compiler directives) that are executed if a GPU (or Phi processor) are detected, otherwise the code just excutes like normal. Looks like:

!$acc region
   !$acc do parallel
   do j=1,m
    ...

Phi Programming

Here is a very detailed article with excerpts below
http://www.theregister.co.uk/2012/11/12/intel_xeon_phi_coprocessor_launch/

“Many Integrated Core” architecture (MIC)
generally available on January 28 to the rest of us for $2,649
Xeon Phi chips run Linux and an OpenMP multiprocessing as well as the message passing interface (MPI) protocol
Intel's C, C++, and Fortran compilers in its Parallel Studio XE set as well as the Cluster Studio XE extensions work on Xeon Phi chips.
a server with two Xeon E5-2670 processors, adding a single Xeon Phi card can boost the performance of various HPC workloads by between a factor of 2.2 to 2.9

“You add parallel directives to the code, and you compile the code to run on both x86 chips in standalone mode and on the x86-Xeon Phi combination. You get one set of compiled code, and if the Xeon Phi chips are present, the work is offloaded from the server CPUs to the x86 coprocessors, and they do the acceleration. If not, the CPUs in the cluster or the workstation do the math.”

That explains that in detail. So this is the same approach as the Portland GPU compiler. One compiled binary. But potentially a lower performance boost than the GPU approach.

The Scheduler Problem

While anticipating a hardware acquisition to delve into the world of GPU computing, I'm trying to abstract out the perceived Scheduler problem. For the sake of this discussion lets assume one 4U rack with 2 CPUs and 4 GPUs. Lets also assume the CPUs are Intel E5-2660 8 cores per CPU 2.0 Ghz (total of 16 cpu cores, “Romnley” series so each CPU can see all GPUs, seems to be the current trend). Assume the GPUs are Nvidia Telsa K20 (roughly 2,500 cuda cores per GPU). We have 4 GB DDR3 memory per GPU, total of 16 GB, and need at least that much for CPUs, but we'll load the board with 64 GB (more on that later).

To generalize; compute jobs that would run on the CPU cores only, in serial or parallel fashion, we'll call cpuHPC. Jobs that run on the GPU cores we'll call gpuHPC. From a Scheduler perspective then we have 16 job slots on the cpuHPC. And 4 perceived job slots on the gpuHPC. But a gpuHPC job, serial or parallel, runs it's serial code on a cpuHPC core and offloads any parallel computations (like loops) to the gpuHPC (serial or parallel). Thus at least 4 of the cpuHPC jobs slots need to be reserved for the gpuHPC, more if you will allow many small gpuHPC jobs, but there are more complications.

In a simple setup you would not allow any non-gpuHPC jobs on this hardware. Even then the Scheduler needs to be aware if GPUs are idle or busy. Nvidia's CUDA SDK provides tools for this like deviceQuery which returns gpuIDs (0-3 in our case). The cpuIDs are provided by the OS (see /proc/cpuinfo, 0-1 in our example). There is a script (gpu-info) that returns the gpuIDs and % utilization. One approach then is to set CUDA_VISIBLE_DEVICES on the node where the gpuHPC job will run and expose only idle GPUs (script gpu-free). Source:http://ambermd.org/gpus/#Running. So in this case the Scheduler can only submit up to 4 gpuHPC serial jobs allowing each gpuHPC job to claim one idle GPU. Less if a parallel MPI gpuHPC job runs across multiple GPUs.

So we need to build some functionality into the submit jobs process that exposes idle GPUs. But how could we do that before submission? IOW let the jobs PEND until an idle GPU is available? We need to build an “elim” script that reports number of available GPUs (integer) and define this Lava resource (lsf.shared). Submit scripts then need to request this resource and quantity. That's perhaps a start, and efficient, if each gpuHPC job used say 99% of the GPU's computational power.

Gets way more complicated if that is not the case. Lets assume a gpuHPC parallel MPI job does not fit inside one GPU but needs 4 GPUs and at peak time utilization is 60% per GPU. (In this scenario no other gpuHPC parallel MPI jobs should be submitted, only serial gpuHPC jobs that can use up to 40% of that computational power of the, now busy, GPUs). Complicated. First the cpuHPC job needing all 4 GPUs needs to make sure: 4 are available and idle, when submission occurs 4 CPU cores need to allocated (does not matter which ones in a one node scenario) and each CPU core gets a unique gpuID assigned. We can do that with the scripts above. Second, the MPI thread count is 4 (-n 4) but we need to make sure they land on one host (span[hosts=1]) and presumably reserve some memory on CPU side (rusage[mem=…]). That can all be done in the one node scenario. In a multiple node setting the hosts=1 constrains the idle GPUs to be on one node.

We could also design a smarter “elim” resource and have % utilization reported (0-100) and use that. (Would we know how much of the GPU computational power we need?). How would one keep a second (parallel MPI) gpuHPC job from being submitted after one is already running? Maybe another elim resource that looks at the presence of a lock file somewhere (0/1) created bu a PREEXEC script scanning for mpirun invocations?

And what happens when you have more than one node? Can a hosts=2 setting a request for 8 GPUs using the MPI GPU enabled binary work? Don't know, we'll see.

Regarding the memory. If one works along the one gpuHPC job per GPU model, then the hardware has a maximm of 4 simultaneous jobs in this example. That leaves 12 cpuHPC cores idle and one can envision setting up another queue for those. So in a minimum memory foot print I'd suggest 16x4GB but if you can afford it, double that.

Start simple.

(As posted on openlava.org forum confirming some things)

Things To Track

power/cooling requirements, connections (L6-30 or rack PDU) and rack U needed)
voltaire compatibility (and expansion) or HCA PCI2ex16 slot 1 (full height, full width riser) to cabinet
gaussian; in dev but K20/K20X compliant, what about cuda v4.2 vs 5.0?
local storage on GPU HPC?
lava configuration (cpu-cores to GPU ratios)

For final quote, add spare parts like HCA cards and (long) IB cables

TechDocs

http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf
Excllent technical document on GPU clusters, programming and scheduler implications
Pre/Post job submission GPU (de)allocations and GPU scrubbing
Portland PGI compiler discussion
MPI compiling using 'nvcc'
yea, it can get scary …

http://docs.nvidia.com/cuda/index.html
install guide
nvcc guide
…

Mantra

Had this thought during my dental cleaning … which I think may be useful in framing our decision:

Go for the GPU HPC solution (because of the teraflops boost)
And if it does not work out, be content with the CPU HPC left (# cores, ram gb/core, etc)

Or

buy a single rack and test locally, start small (will future racks be compatible?)

Yale Qs

Tasked with getting GPU HPC going at Wesleyan and trying to gain insights into the project. If you acquired a GPU HPC …

What was the most important design element of the cluster?
What factor(s) settled the CPU to GPU ratio?
Was either, or neither, single or double precision peak performance more/less important?
What was the software suite in mind (commercial, open source, or custom code GPU “enabled”)?
How did you reach out/educate users on the aspects of GPU computing?
What was the impact on the users? (recoding, recompiling)
Was the expected computational speed up realized?
Was the PGI Accelerator compilers leveraged? If so what were the results?
Do users compile with nvcc?
Does the scheduler have a resource for idle GPUs so they can be reserved?
How are the GPUs exposed/assigned to jobs the scheduler submits?
Do you allow multiple serial jobs to access the same GPU? Or one parallel job multiples GPUs?
Can parallel jobs access mutliple GPUs across nodes?
Any experiences with pmemd.cuda.MPI (part of Amber)?
What MPI flavor is used most in regards to GPU computing?
Do you leverage the CPU HPC of the GPU HPC? For example, if there are 16 GPUs and 64 CPU cores on a cluster, do you allow 48 standard jobs on the idle cores? (assuming the max of 16 serial GPU jobs)

ConfCall & Quote: AC

09nov12:

/home and /apps mounted on CPU side. How does GPU access these? Or is job on CPU responsible for this?
Single versus double precision? Both needed I assume.
Unit above is Nvidia “Fermi” series, being phased out. “Kepler” K10 and K20 series coming out. Get an earlybird unit, Jim will find out.
Lava compatibility (almost certain but need to check) AC uses SGE.
We do not really “know” if our current jobs would experience a boost in speed (hence one unit first - but there is a software problem here)
Intel Xeon Phi Co-Processors: Intel compilers will work on this platform (which is huge!) and no programming learning curve. (HP Proliant servers with 50+ cores), Jim will find out.
~~Vendor states scheduler sees GPUs directly (but how does it then get access to home dirs, check this out)~~ … update: this is not true, CPU job offloads to GPU

AC Quote

Early 2013 product line up
- http://www.advancedclustering.com/products/pinnacle-flex-system.html Kansas City, KS
Quote coming for single 4U unit, which could be a one off test unit (compare to HP)

Topic	Description
General	2 CPUs (16 cores), 3 GPUs ( 7,500 cuda cores), 32 gb ram/node
Head Node	None
Nodes	1x4U Rackmountable Chassis, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16cores/node), Romley series
	8x4GB 240-Pin DDR3 1600 MHz memory (32gb/node), 11gb/gpu, max 256gb)
	1x120GB SATA 2.5“ Solid State Drive (OS drive), 7x3TB 7200RPM
	3xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio
	2×10/100/1000 NIC, 3x PCIE 3.0 x16 Slots
	1xConnectX-3 VPI adapter card, single-port 56Gb/s
	2x1620W Redundant Power Supplies
Network	1×36 port Infiniband FDR (56Gb/s) switch & 4xConnectX-3 single port FDR (56Gb/s) IB adapter + 2x 2 meter cables (should be 4)
Power	Rack power ready
Software	None
Warranty	3 Year Parts and Labor (AC technical support)
GPU Teraflops	3.51 double, 10.56 single
Quote	`<!-- $33,067.43 S&H included -->`Arrived

In order to match the “benchmark option” we need 5 units
- 8100 Watts, would still fit power wise but not rack wise (we'd need 20U)

Single rack, 21 TB of disk space (Raid 5/6)
The IB switch (plus 4 spare cards/cables) is roughly 1/3rd of the price
- If we remove it, we need QDR Voltaire compliant HCAs and cables (3 ports free)
The config does not pack as much teraflops for the dollars; we'll see

ConfCall & Quote: EC

12nov12:

GPU hardware only
scheduler never sees gpus just cpus
cpu to gpu is one-to-one when using westmere chips
bright cluster management (image based) - we can front end with lava
what's the memory connection cpu/gpu???
home dirs - cascade via voltaire 4036, need to make sure this is compatible!
software on local disk? home dirs via infiniband ipoib, yes, but self install
amber (charge for this) and lammps preinstalled - must be no problem, will be confirmed
2 K20 per 2 CPUs per rack 900-1000W, 1200 W power supply on each node
PDU on simcluster, each node has power connection
quote coming for 4 node simcluster
testing periods can be staged so you are testing exactly what we're buying if simcluster if within budget (see K20 above)

EC Quote

http://exxactcorp.com/index.php/solution/solu_detail/119 Fremont, CA

Topic	Description
General	8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 64 gb ram/node, plus head node
Head Node	1x1U Rackmount System, 2xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores
	8x8GB 240-Pin DDR3 1600 MHz ECC (max 256gb), 2×10/100/1000 NIC, 2x PCIe x16 Full
	2x2TB 7200RPM (can hold 10), ConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s
	600w Power Supply
Nodes	4x2U Rackmountable Chassis, 8xXeon E5-2660 2.20 Ghz 20MB Cache 8 cores (16/node), Romley series
	32x8GB 240-Pin DDR3 1600 MHz (64gb/node memory, 16gb/gpu, max 256gb)
	4x1TB 7200RPM, 16xNVIDIA Tesla K20 8 GB GPUs (4/node), 1CPU-2GPU ratio
	2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots
	4xConnectX-2 VPI adapter card, Single-Port, QDR 40Gb/s
	4x1800W Redundant Power Supplies
Network	1x Mellanox InfiniBand QDR Switch (8 ports)& HCAs (single port) + 7' cables
	1x 1U 16 Port Rackmount Switch, 10/100/1000, Unmanaged (+ 7' cables)
Power	2xPDU, Basic, 1U, 30A, 208V, (10) C13, Requires 1x L6-30 Power Outlet Per PDU
Software	CentOS, Bright Cluster Management (1 year support)
	Amber12 (cluster install), Lammps (shared filesystem), (Barracuda for weirlab?)
Warranty	3 Year Parts and Labor (EC technical support?)
GPU Teraflops	18.72 double, 56.32 single
Quote	`<!-- $93,600 + S&H -->`Arrived

Lets make this the “benchmark option” based on double precision
In order to match this with Xeon Phis we'd need 18 of them (probably 5 4U trays)

This is the (newest) simcluster design (that can be tested starting Jan 2013)
- 24U cabinet
We could deprecate 50% of bss24 queue freeing two L6-30 connectors
Spare parts:
- Add another HCA card to greentail and connect to Mellanox switch (long cable)
  - also isolates GPU traffic from other clusters
- 1 8-port switch, 4 HCA cards, 4 long cables (for petal/swallow tails plus spare)
New head node
- First let EC install Bright/Openlava (64 CPU cores implies 64 job slots)
  - 16 GPUs implies 16×2,500 or 40,000 cuda cores (625 per job slot on average)
- Use as standalone cluster or move GPU queue to greentail
- If so, turn this head node into a 16 job slot ram heavy compute node?
  - 256-512gb (Order?)
  - add local storage? (up to 10 1or2 TB disks)
Compute nodes
- add local storage? (up to 10 1or2 TB disks)
Bright supports openlava and GPU monitoring (get installed)
- http://www.brightcomputing.com/Linux-Cluster-Workload-Management.php
- http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
EC software install
- sander, sander.MPI, pmemd, pmemd.cuda (single GPU version), pmemd.cuda.MPI (the multi-GPU version)
- NVIDIA Toolkit v4.2. Please note that v5.0 is NOT currently supported
- MVAPICH2 V1.8 or later / MPICH2 v1.4p1 or later recommended, OpenMPI is NOT recommended.)
- make sure they do not clean source, analyze how they compiled
- which compiler will they use? which MPI ~~(prefer OpenMPI, have wrapper script for that)~~

ConfCall & Quote: HP

HP 19nov12: meeting notes

HP ProLiant SL270s Generation 8 (Gen8); 4U half width with 2 CPUs + 8 (max) GPUs
- The s6500 Chassis is 4U tray holding two S270s servers
max 8 GPUs (20,000 cuda cores) + 2 CPUs (total 16 cores), dual drives, 256gb max
- K20 availability will be confirmed by Charlie
power
- Charlie will crunch numbers of existing HPC and assess if we can use the current rack
- otherwise a stand alone half rack solution
~~one IB cable to Voltaire per chassis?~~ get new FDR infiniband switch, period.
- connect greentail with additional HCA card, or voltaire to voltaire?
our software compilation problem, huge
- but they have great connections with Nvidia for compilation help (how to qualify that?)
CMU for GPU monitoring, 3-rendering of what GPU is doing
This SL270s can also support up to 8 Xeon Phi coprocessors
- but expect very lengthy delays, Intel is not ready for delivery (1 Phi = 1 double teraflop)

HP Quote

http://h18004.www1.hp.com/products/quickspecs/14405_div/14405_div.HTML

~~First unit, single tray in chassis~~
This hardware can be tested at ExxactCorp so single tray purchase for testing not a requirement

2 chassis in 8U + 4 SL250s + each with 8 GPUs would be a massive GPU cruncher
- 8 CPUs, 32 GPUs = 64 cpu cores and 80,000 cuda cores (avg 1,250cuda/core)
- peak performance: 37.44 double, 112.64 single precision (twice the “benchmark option”)
1 chassis in 4U + 2 Sl250s + each with * GPUs would the “benchmark option”

Topic	Description
General	6 CPUs (total 48 cores), 18 GPUs (45,000 cuda cores), 64 gb ram/node, no head node
Head Node	None
Chassis	2xs6500 Chassis (4U) can each hold 2 half-width SL250s(gen8, 4U) servers, rackmounted, 4x1200W power supplies, 1x4U rack blank
Nodes	3xSL250s(gen8), 3x2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores (total 16 cores/node)), Romley series
	3x16x8GB 240-Pin DDR3 1600 MHz (64gb/node, 10+ gb/gpu, max 256gb)
	3x2x500GB 7200RPM, 3x6xNVIDIA Tesla K20 5 GB GPUs (5 gpu/node), 1CPU-to-3GPU ratio
	3x2x10/100/1000 NIC, Dedicated IPMI Port, 3x8x PCIE 3.0 x16 Slots (GPU), 3x2x PCIE 3.0 x8
	3x2xIB interconnect, QDR 40Gb/s, FlexibleLOM goes into PCI3x8 slot
	chassis supplied power; 3x1x one PDU power cord (416151-B21)? - see below
Network	1xVoltaire QDR 36-port infiniband 40 Gb/s switch, + 6x 5M QSFP IB cables
	No ethernet switch, 17x 7' CAT5 RJ45 cables
Power	rack PDU ready, what is 1x HP 40A HV Core Only Corded PDU???
Software	RHEL, CMU GPU enabled (1 year support) - not on quote???
Warranty	3 Year Parts and Labor (HP technical support?)
GPU Teraflops	21.06 double, 63.36 single
Quote	`<!-- $128,370, for a 1x6500+2xSl250 setup estimate is $95,170 -->`Arrived (S&H and insurance?)

To compare with “benchmark option” price wise; 37% higher (25% less CPU cores)
To compare with “benchmark option” performance; 12.5% higher (double precision peak)

When quote is reduced to 1x s6500 chassis and 2x SL250s:
- To compare with “benchmark option” price wise; 1.6% higher (50% less CPU cores)
- To compare with “benchmark option” performance; 25% lower (double precision peak)

HP on site install
we have 9U in HP rack available (1U for new switch)
- L6-30 7,500 Watts x3 PDUs (non-UPS) = 22,500 Watts - HP cluster 10,600 Watts
- leaves 11,898 Watts, should be sufficient for 4 SL270s(redundant power supplies)
new infiniband switch, isolates GPU cluster traffic from rest of HPC
- 36 port IB switch overkill
- still need IB connection greentail to new switch (home dirs IPoIB)
1 TB local storage per node
our software install problem, so is the 12.5% worth it? (with 3 trays)

ConfCall & Quote: AX

Cluster management is ROCKS (we'll pass)
No scheduler (that's OK, we'll use OpenLava)
They do not install software, only operating system and
CUDA driver setup and installation

AX Quote

http://www.amax.com/hpc/productdetail.asp?product_id=simcluster Fremont, CA

Topic	Description
General	8 CPUs (48 cores), 12 GPUs (30,000 cuda cores), 64 gb ram/node, plus head node
Head Node	1x1U Rackmount System, 2x Intel Xeon E5-2620 2.0GHz (12 cores total)
	64GB DDR3 1333MHz (max 256gb), 2×10/100/1000 NIC, 2x PCIe x16 Full
	2x1TB (Raid 1) 7200RPM, InfiniBand adapter card, Single-Port, QSFP 40Gb/s
	???w Power Supply, CentOS
Nodes	4x1U, 4x2xIntel Xeon E5-2650 2.0GHz, with 6 cores (12cores/node) Romley series
	4x96GB 240-Pin DDR3 1600 MHz (96gb/node memory, 8gb/gpu, max 256gb)
	4x1TB 7200RPM, 12xNVIDIA Tesla K20 8 GB GPUs (3/node), 1CPU-1.5GPU ratio
	2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots
	4xInfiniband adapter card, Single-Port, QSFP 40Gb/s
	4x??00W Redundant Power Supplies
Network	1x Infiniband Switch (18 ports)& HCAs (single port) + ?' cables
	1x 1U 24 Port Rackmount Switch, 10/100/1000, Unmanaged (+ ?' cables)
Power	there are 3 rack PDUs? What are the connectors, L6-30?
Software	CUDA only
Warranty	3 Year Parts and Labor (AX technical support?)
GPU Teraflops	14.04 double, 42.96 single
Quote	`<!-- $73,965 (S&H $800 included) -->`Arrived

22U cabinet
Insurance during shipping is our problem (non-returnable)
To compare with “benchmark option” price wise; 21% lower (25% less CPU cores)
To compare with “benchmark option” performance; 22% lower (double precision peak)
If we go turnkey systems having software installed is huge

ConfCall & Quote: MW

sells both individual racks and turn-key systems
- racks are 4U with 2 CPUs and 8 GPUs, 2200 Watts, K20X GPUs
- turn-key units are per customer specifications
they will install all software components (if license keys are provided)
- includes CUDA drivers and setup, Amber (pmemd.cuda & pmemd.cuda.MPI, check) and Lammps
- but also Matlab and Mathematica if needed (wow!)
standard 2 year warranty though (no biggie)

MW Quote

http://www.microway.com/tesla/clusters.html Plymouth, MA

Topic	Description
General	8 CPUs (64 cores), 16 GPUs (40,000 cuda cores), 32 gb ram/node, plus head node
Head Node	1x2U Rackmount System, 2xXeon E5-2650 2.0 Ghz 20MB Cache 8 cores
	8x4GB 240-Pin DDR3 1600 MHz ECC (max 512gb), 2×10/100/1000 NIC, 3x PCIe x16 Full, 3x PCIe x8
	2x1TB 7200RPM (Raid 1) + 6x2TB (Raid 6), Areca Raid Controller
	Low profile graphics card, ConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s
	740w Power Supply 1+1 redundant
Nodes	4x1U Rackmountable Chassis, 4×2 Xeon E5-2650 2.0 Ghz 20MB Cache 8 cores (16/node), Sandy Bridge series
	4x8x4GB 240-Pin DDR3 1600 MHz (32gb/node memory, 8gb/gpu, max 256gb)
	4x1x120GB SSD 7200RPM, 4x4xNVIDIA Tesla K20 5 GB GPUs (4/node), 1CPU-2GPU ratio
	2×10/100/1000 NIC, Dedicated IPMI Port, 4x PCIE 3.0 x16 Slots
	4xConnectX-3 VPI adapter card, Single-Port, FDR 56Gb/s
	4x1800W (non) Redundant Power Supplies
Network	1x Mellanox InfiniBand FDR Switch (36 ports)& HCAs (single port) + 3m cable FDR to existing Voltaire switch
	1x 1U 48 Port Rackmount Switch, 10/100/1000, Unmanaged (cables)
Rack
Power	2xPDU, Basic rack, 30A, 208V, Requires 1x L6-30 Power Outlet Per PDU (NEMA L6-30P)
Software	CentOS, Bright Cluster Management (1 year support), MVAPich, OpenMPI, CUDA 5
	scheduler and gnu compilers installed and configured
	Amber12, Lammps, Barracuda (for weirlab?), and others if desired …bought through MW
Warranty	3 Year Parts and Labor (lifetime technical support)
GPU Teraflops	18.72 double, 56.32 single
Quote	`<!-- estimated at $95,800 -->`Arrived, includes S&H and Insurance
Upgrades	Cluster pre-installation service
	5×2 E5-2660 2.20 Ghz 8 core CPUs
	5x upgrade to 64 GB per node

At full load 5,900 Watts and 20,131 BTUs/hour

2% more expansive than “benchmark option” (as described above with Upgrades), else identical
- But a new rack (advantageous for data center)
- With lifetime technical support
- solid state drives on compute nodes
- 12 TB local storage

Then

36 port FDR switch replace with 8 port QDR switch for savings (40 vs 56 Gbps)
- and all server adapter cards to QDR (with one hook up to existing Voltaire switch)
Expand memory footprint
- Go to 124 GB memory/noe to beef up the CPU HPC side of things
- 16 cpu cores/nodes minus 4 cpu/gpu cores/node = 12 cpu cores using 104gb which is about 8 GB/cpu core
Online testing available (K20, do this)
- then decide on PGI compiler at purchase time
- maybe all Lapack libraries too
Make the head node a compute node (in/for the future and beef it up too, 256 GB ram?)
Leave the 6x2TB disk space (for backup)
- 2U, 8 drives up to 6×4=24 TB, possible?
Add an entry level Infiniband/Lustre solution
- for parallel file locking

Spare parts
- 8 port switch, HCAs and cables, drives …
- or get 5 years total warranty

Back