cluster:125

Dell Racks Power Off
What Changes?
What May Also Change?

Done! — Meij, Henk 2014/02/21 09:54

Dell Racks Power Off

Soon (Feb/2014), we'll have to power down the Dell Racks and grab one L6-30 circuit supplying power to those racks and use it to power up the new Microway servers.

That leaves some spare L6-30 circuits (the Dell racks use 4 each), so we could contemplate grabbing two and powering up two more shelves of the Blue Sky Studio hardware. That would double the Hadoop cluster and the bss24 queue when needed (total of 100 job slots), and offer access to 1.2 TB of memory. This hardware is generally powered off when not in use.

The new Microway hardware is identical to the GPU-HPC hardware we bought previously minus the GPUs. A total of 8 1U servers will offer

256 GB of memory per node (2,048 GB total … that's amazing because if you add the GPU nodes memory footprint, the total for the rack becomes 3,328 GB in 18U of rack space).
dual 8-core Intel Xeon chips with hyperthreading turned on, so each node presents 32 cores for a total of 256 cores (job slots). These will be presented as queue mw256fd
Each core is capable of doing 8 instructions per clock cycle and each core will have access to an average of 8 GB of memory
Each node also has a 300 GB 15K RPM hard disk holding the operating system, swap and provides for a /localscratch of 175 GB, hence the fd of mw256fd queue name. It is to be used just like ehwfd.
Each node is Infiniband enabled (meaning all our nodes are except the Blue Sky Studio, queue bss24). /home and /sanscratch are served IPoIB.

What Changes?

Queues:

elw, emw, ehw, ehwfd and imw disappear (224 job slots)
mw256fd appears (256 job slots)
on both mw256 (n33-n37) and mw256fd (n38-n45) exclusive use is disabled (#BSUB -x will not work)
the max number of jobs slots per node is 32 on mw256fd but 28 on mw256 because the GPUs also need access to cores (4 per node for now) … for now, it may be that max is going to be set to 8 if too many jobs grab too many job slots. You should benchmark your job to understand what is optimal.

Memory:

Since fewer and fewer nodes are deployed in our cluster with large memory footprints, it becomes important to estimate how much memory you need (add 10-20%) and reserve that via the scheduler so your jobs do not crash.

#BSUB -R "rusage[mem=X]"

How do I find out how much memory I'm using? ssh node_name top -u your_name -b -n 1

Gaussian:

In order to force your gaussian threads onto the same node (since it is a forked program not a parallel program), when using any of mw256 queues, you must use the following stanza's:

#BSUB -n X (where X is equal to or less than the max jobs per node)
#BSUB -R "span[hosts=1]"

MPI:

You can use the new queue mw256fd just like hp12 or imw
For parallel programs you may use OpenMPI or MVApich, use the appropriate wrapper scripts to set up the environment for mpirun
- On mw256 you may run either flavor of MPI with the appropriate binaries.
On mwgpu you must use MVApich2 when running the GPU enabled software (Amber, Gromacs, Lammps, Namd).

Scratch:

On all nodes /sanscratch is always the same and job progress can be viewed from all “tail” login nodes. It is a 5 disk 5TB storage area for large jobs needing much disk space. When using /sanscratch you need to stage your data and code on those disks and copy the results back to your directory before the job finishes.
On all nodes /localscratch is a local directory like /tmp. It is tiny (50 GB) and should be used for file locking purposes if you need to do so.
Only nodes on mw256fd sport a 15K hard disk and /localscratch is 175 GB (replacing the ehwfd functionality).

Savings:

77% less energy is consumed including what's needed for the new hardware, amazing.

Workshop:

We'll schedule one as soon as mw256fd has been deployed. Feb 26th ST 509a 4-5 PM.

What May Also Change?

There is a significant need to run many, many programs that require very little memory (like in the order of 1-5 MB). When such programs run they consume a job slot. When many such programs consume many job slots, like on the large servers in the mw256 or mw256fd queues lots of memory remains idle and inaccessible by other programs.

So we could enable hyperthreading on the nodes of the hp12 queue and double the jobs slots (from 256 to 512). Testing reveals that when hyperthreading is on

if there is no ‘sharing’ required the hyper-threaded node performs the same (that is the operating systems presents 16 cores but only up to 8 jobs are allowed to run, lets say by limiting the JL/H parameter of the queue)
if there is ‘sharing’ jobs take a 44% speed penalty, however more of them can run, twice as many

So it appears that we could turn hyperthreading on and despite the nodes presenting 16 cores we could limit the number of jobs to 8 until the need arises to run many small jobs and then reset the limit to 16.

Back

Table of Contents

Dell Racks Power Off

What Changes?

What May Also Change?