Done! — Meij, Henk 2014/02/21 09:54
Soon (Feb/2014), we'll have to power down the Dell Racks and grab one L6-30 circuit supplying power to those racks and use it to power up the new Microway servers.
That leaves some spare L6-30 circuits (the Dell racks use 4 each), so we could contemplate grabbing two and powering up two more shelves of the Blue Sky Studio hardware. That would double the Hadoop cluster and the
bss24 queue when needed (total of 100 job slots), and offer access to 1.2 TB of memory. This hardware is generally powered off when not in use.
The new Microway hardware is identical to the GPU-HPC hardware we bought previously minus the GPUs. A total of 8 1U servers will offer
mw256fdqueue name. It is to be used just like
/sanscratchare served IPoIB.
mw256fdbut 28 on
mw256because the GPUs also need access to cores (4 per node for now) … for now, it may be that max is going to be set to 8 if too many jobs grab too many job slots. You should benchmark your job to understand what is optimal.
#BSUB -R "rusage[mem=X]"
#BSUB -n X (where X is equal to or less than the max jobs per node) #BSUB -R "span[hosts=1]"
mw256you may run either flavor of MPI with the appropriate binaries.
mwgpuyou must use MVApich2 when running the GPU enabled software (Amber, Gromacs, Lammps, Namd).
mw256fdsport a 15K hard disk and /localscratch is 175 GB (replacing the
mw256fdhas been deployed. Feb 26th ST 509a 4-5 PM.
There is a significant need to run many, many programs that require very little memory (like in the order of 1-5 MB). When such programs run they consume a job slot. When many such programs consume many job slots, like on the large servers in the
mw256fd queues lots of memory remains idle and inaccessible by other programs.
So we could enable hyperthreading on the nodes of the
hp12 queue and double the jobs slots (from 256 to 512). Testing reveals that when hyperthreading is on
So it appears that we could turn hyperthreading on and despite the nodes presenting 16 cores we could limit the number of jobs to 8 until the need arises to run many small jobs and then reset the limit to 16.