This is an old revision of the document!
Table of Contents
GTX 1080 Ti
We have Enterprise Level Telsa K20 GPU compute nodes (graphical processing units). Four per node for a total of 20 K20s capable of roughly 23 total Teraflops (floating point, double precision). $100K in 2013.
We have received a donation for a special purpose research project and have acquired a 1U rack mounted compute node which will hold 4 GTX1080Ti GPUs (with 128 GB of CPU memory and two E5-2650v4 8 core chips). $8K in 2017
- GTX 4-gpu node: About 44 FP32 and 1.5 FP64 teraflops in 1U. $8,200 in 2017.
- K20 4-gpu node: About 14 FP32 and 4.7 FP64 teraflops in 2U. $20,000 in 2013.
We will disable Persistent and Exclusive Compute modes in our K20 GPU nodes, which Gromacs and Lammps tend to prefer. This new GTX GPU node will have these modes enabled as is suggested on the Amber web site. Amber, except for starting and finishing a computational job, does not use the CPU once the job is launched in the GPU. Gromacs and Lammps do.
The new GTX environment will be similar as our K20, scheduler wise. There will be a tiny queue amber128 with 24 cpu cores, 4 gpus and one node. New utilities are provided to interact with the GPUs and feed usage information to the scheduler.
# query [root@cottontail ~]# ssh n78 gtx-info index, name, temperature.gpu, memory.used [MiB], memory.free [MiB], utilization.gpu [%], utilization.memory [%] 0, GeForce GTX 1080 Ti, 31, 0 MiB, 11172 MiB, 0 %, 0 % 1, GeForce GTX 1080 Ti, 24, 0 MiB, 11172 MiB, 0 %, 0 % 2, GeForce GTX 1080 Ti, 26, 0 MiB, 11172 MiB, 0 %, 0 % 3, GeForce GTX 1080 Ti, 26, 0 MiB, 11172 MiB, 0 %, 0 % [root@cottontail ~]# ssh n78 gtx-free 3,1,0,2 [root@cottontail ~]# lsload -l n78 HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem gpu n78 ok 0.0 0.1 0.1 0% 0.0 0 0 2e+08 1710G 10G 125G 4.0 <--
Our mvapich2 wrapper will be cloned to mpich3 for this new server (posted in /usr/local/bin as to not confuse other nodes). It will still allow you to request up to 8 CPU cores per GPU resource for non-Amber jobs. For Amber jobs, the priority usage of this node for the next three years, you only need to request one CPU core for each GPU job (-n 1, gpu=1), see example script below. Amber can run across two GPUs with peer to peer support enabled but the wrapper does not support this currently. (Can only be done in physical pairs…GPU 0&2 and 1&3). We will see if we need that. This would be the case if your simulation needs between 11-22G of memory on GPU side.
Look at /usr/local/[Amber16_Benchmark_Suite|GPU_Validation_Test] on n78, lots of good amber stuff to browse there. Amber itself is installed /usr/local/amber16. Please note that the default pmemd.cuda[.MPI] points to SPFP, single precision floating point, as is/was the case in all our other HPC Amber installations. But DPFP (double precision) binary is available.
There is no BLCR kernel modules and X11 support is disabled in the GPUs. Video Mode is onboard, meaning users can not use the system interactively with VMD or other 3-D applications.
Operating system is CentOS7.4
/localscratch on n78 internal hard drive has 1.5T available.
# sample script, see bottom of page # my test run, and I really do not know what I'm doing here, # showed this GTX1080ti to be 4x faster than a K20 gpu # a K20 runs this same test 10x faster than a CPU
We will also bring another 33T of storage online for the estimated content this new GPU server will generate in the next three years. Some users' home directories will be moved to this storage array with no quota policies. That will free up space in sharptail:/home so I'm again delaying enlarging this partition. Maybe over Christmas break if a “island power test” is scheduled and we need to shut down.
Server arrived Oct 12, 2017.
# design seems to yield one gpu much hotter than the 3 in the front of the server # (air flows bottom to top in picture), so we'll probably blow that one first # (max 91C or 196F, 83C is 181F). Infrared Thermometer shows 56-60C on that gpu pointed directly at it. index, name, temperature.gpu, memory.used [MiB], memory.free [MiB], utilization.gpu [%], utilization.memory [%] 0, GeForce GTX 1080 Ti, 83, 352 MiB, 10820 MiB, 55 %, 1 % 1, GeForce GTX 1080 Ti, 67, 352 MiB, 10820 MiB, 66 %, 1 % 2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 % 3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 % # note, from nvidia-smi --help-query-gpu "temperature.gpu" Core GPU temperature. in degrees C.
Surprise. The server came with 34“ rails but the server itself is 36” deep. That does not fit in my racks with the power distribution running vertical in the back of the rack. Oh well. Squeezed it in the very bottom 1U of a rack just clearing power cords.
Bench
- Amber 16. My sample script runs 3-4x faster than on a K20
- Do not have enough expertise to assess this, need stats from Kelly
- Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20
- Can probably be improved
- 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day
- Lammps
Scripts
/home/hmeij/sharptail/run.gtx, an Amber/Gromacs/Lammps specific submit script
#!/bin/bash # submit via 'bsub < run.gtx' rm -f out err #BSUB -e err #BSUB -o out #BSUB -q amber128 #BSUB -J "GTX test" # cuda 8 & mpich export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export PATH=/usr/local/mpich-3.1.4/bin:$AMBERHOME/bin:$PATH export LD_LIBRARY_PATH=/usr/local/mpich-3.1.4/lib:$LD_LIBRARY_PATH ## leave sufficient time between job submissions (30-60 secs) ## the number of GPUs allocated matches -n value automatically ## always reserve GPU (gpu=1), setting this to 0 is a cpu job only ## reserve 12288 MB (11 GB + 1 GB overhead) memory per GPU ## run all processes (1<=n<=4)) on same node (hosts=1). # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH cd $MYLOCALSCRATCH # uncomment one software block by removing one on # each line ## AMBER we need to recreate env, $AMBERHOME is already set ##BSUB -n 1 ##BSUB -R "rusage[gpu=1:mem=12288],span[hosts=1]" #export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH #export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH #source /usr/local/amber16/amber.sh ## stage the data #cp -r ~/sharptail/* . ## feed the wrapper #n78.mpich3.wrapper pmemd.cuda.MPI \ #-O -o mdout.$LSB_JOBID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd ## save results #scp mdout.$LSB_JOBID ~/sharptail/ # GROMACS (using all GPUs example) #BSUB -n 4 #BSUB -R "rusage[gpu=4:mem=49152],span[hosts=1]" export CPU_GPU_REQUEST=4:4 # signal GMXRC is a gpu run with: 1=thread_mpi 2=mpich3 export GMXRC=2 export PATH=/usr/local/gromacs-5.1.4/bin:$PATH export LD_LIBRARY_PATH=/usr/local/gromacs-5.1.4/lib64:$LD_LIBRARY_PATH . /usr/local/gromacs-5.1.4/bin/GMXRC.bash cd /home/hmeij/gromacs_bench/gpu/ n78.mpich3.wrapper gmx_mpi mdrun \ -nsteps 600000 -multidir 01 02 03 04 -gpu_id 0123 -ntmpi 0 -npme 0 -s topol.tpr -ntomp 0 -pin on -nb gpu # LAMMPS #BSUB -n 1 #BSUB -R "rusage[gpu=1:mem=12288],span[hosts=1]" # GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header input file) export GPUIDX=1 # use with -var $GPUIDX in inout file, view au.in, or use -suffix export PATH=/usr/local/lammps-11Aug17:$PATH # stage the data cp -r ~/sharptail/* . # feed the wrapper n78.mpich3.wrapper lmp_mpi-double-double-with-gpu \ -suffix gpu -var GPUIDX $GPUIDX -in in.colloid -l out.colloid.$LSB_JOBID # save results scp out.colloid.$LSB_JOBID ~/sharptail/
/usr/local/bin/gtx-info
#!/bin/bash nvidia-smi \ --query-gpu=index,gpu_name,temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory \ --format=csv
/usr/local/bin/gtx-free
#!/bin/bash
# adapted for GTX1080ti -hmeij
# gpu-free
# Written by Jodi Hadden, University of Georgia, 2011. Updated 2013.
# Print free GPU deviceQuery IDs to export as visible with CUDA_VISIBLE_DEVICES
# Works for NVIDIA-SMI 3.295.41 Driver Version: 295.41 for Tesla series cards
# Check for deviceQuery
if ! type nvidia-smi >/dev/null
then
echo "Please add the location of nvidia-smi to your PATH!"
exit
fi
# Make all GPUs temporarily visible
num_gpus=4
export CUDA_VISIBLE_DEVICES=`seq -s , 0 $num_gpus`
pci_bus_array=( 0 1 2 3 )
# For each PCI bus ID, check to see if that GPU is being utilized
gpu_id=0
while [ $gpu_id -lt ${#pci_bus_array[*]} ]
do
# Get utilization from NVIDIA-SMI
utilization=`nvidia-smi --id=$gpu_id --query-gpu=utilization.gpu --format=csv | grep -v util | awk '{print $1}'`
# If a GPU is not being utilized, add its deviceQuery ID to the free GPU array
# Note: GPUs can show 1% utilization if NVIDIA-SMI is running in the background, so used -le 1 instead of -eq 0 here
if [ $utilization -le 1 ]
then
free_gpu_array[$gpu_id]=$gpu_id
fi
let gpu_id=$gpu_id+1
done
# Print free GPUs to export as visible
free_gpus=`shuf -e ${free_gpu_array[*]}`
echo $free_gpus | sed "s/ /,/g"
- and the new wrapper referenced in submit script
run.gtx
#!/bin/sh
# This is a copy of lava.openmpi.wrapper which came with lava OCS kit
# Trying to make it work with mpich3 for n78 Amber16
# -hmeij 18oct2017
#
# Copyright (c) 2007 Platform Computing
#
# This script is a wrapper for openmpi mpirun
# it generates the machine file based on the hosts
# given to it by Lava.
#
# RLIMIT_MEMLOCK problem with libibverbs -hmeij
ulimit -l unlimited
usage() {
cat <<USEEOF
USAGE: $0
This command is a wrapper for mpirun (openmpi). It can
only be run within Lava using bsub e.g.
bsub -n # "$0 -np # {my mpi command and args}"
The wrapper will automatically generate the
machinefile used by mpirun.
NOTE: The list of hosts cannot exceed 4KBytes.
USEEOF
}
if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
usage
exit -1
fi
MYARGS=$*
WORKDIR=`dirname ${LSB_JOBFILENAME}`
MACHFILE=${WORKDIR}/mpi_machines.${LSB_JOBID}
ARGLIST=${WORKDIR}/mpi_args.${LSB_JOBID}
# Check if mpirun is in the PATH, should be /usr/local/mpich-3.1.4/bin/mpirun -hmeij
T=`which --skip-alias mpirun`
if [ $? -ne 0 ]; then
echo "Error: mpirun is not in your PATH."
exit -2
fi
echo "${MYARGS}" > ${ARGLIST}
#T=`grep -- -machinefile ${ARGLIST} |wc -l`
#T=`grep -- -hostfile ${ARGLIST} |wc -l`
T=`grep -- -f ${ARGLIST} |wc -l`
if [ $T -gt 0 ]; then
echo "Error: Do not provide the machinefile for mpirun."
echo " It is generated automatically for you."
exit -3
fi
# Make the open-mpi machine file
echo "${LSB_HOSTS}" > ${MACHFILE}.lst
tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}
MPIRUN=`which --skip-alias mpirun`
# sanity checks number of cpu processes 1-8
np=`wc -l ${MACHFILE} | awk '{print $1}'`
if [ $np -lt 1 -o $np -gt 8 ]; then
echo "Error: Incorrect number of processes ($np)"
echo " -n can be an integer in the range of 1 to 8"
exit -4
fi
# sanity check single node
nh=`cat ${MACHFILE} | sort -u | wc -l`
if [ $nh -ne 1 ]; then
echo "Error: No host or more than one host specified ($nh)"
exit -5
fi
# always on one host
gpuhost=`cat ${MACHFILE} | sort -u | tr -d '\n'`
# next choose defaul policy (one cpu per gpu) or custom policy (eg 2:1, 4:2, max 8:4)
if [ -n "$CPU_GPU_REQUEST" ]; then
gpunp=`echo $CPU_GPU_REQUEST | awk -F: '{print$2}'`
cpunp=`echo $CPU_GPU_REQUEST | awk -F: '{print$1}'`
gpuid=( $(for i in `ssh $gpuhost gtx-free | sed "s/,/ /g"`; do echo $i; done | head -$gpunp) )
echo "REQUEST allocation $gpuhost:gpunp=$gpunp, cpunp=$cpunp"
else
cpunp=`cat ${MACHFILE} | wc -l | awk '{print $1}'`
gpunp=1
gpuid=( $(for i in `ssh $gpuhost gtx-free | sed "s/,/ /g"`; do echo $i; done | head -$gpunp) )
fi
if [ $gpunp -eq 1 ]; then
CUDA_VISIBLE_DEVICES=$gpuid
echo "GPU allocation $gpuhost:$gpuid"
else
gpuids=`echo ${gpuid[@]} | sed "s/ /,/g"`
CUDA_VISIBLE_DEVICES="$gpuids"
echo "GPU allocation $gpuhost:$CUDA_VISIBLE_DEVICES"
fi
# namd ignores this
export CUDA_VISIBLE_DEVICES
#debug# setid=`ssh $gpuhost echo $CUDA_VISIBLE_DEVICES | tr '\n' ' '`
#debug# echo "setid=$setid";
# gromacs
if [ -n "$GMXRC" ]; then
# gromacs needs them from base 0, so gpu 2,3 is string 01
if [ ${#gpuid[*]} -eq 1 ]; then
gmxrc_gpus="0"
elif [ ${#gpuid[*]} -eq 2 ]; then
gmxrc_gpus="01"
elif [ ${#gpuid[*]} -eq 3 ]; then
gmxrc_gpus="012"
elif [ ${#gpuid[*]} -eq 4 ]; then
gmxrc_gpus="0123"
fi
if [ $GMXRC -eq 1 ]; then
newargs=`echo ${MYARGS} | sed "s/mdrun/mdrun -gpu_id $gmxrc_gpus/g"`
echo "executing: $newargs"
$newargs
elif [ $GMXRC -eq 2 ]; then
newargs=`echo ${MYARGS} | sed "s/mdrun_mpi/mdrun_mpi -gpu_id $gmxrc_gpus/g"`
echo "executing: ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp $newargs"
${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp $newargs
fi
# matlab
elif [ -n "$MATGPU" ] && [ $MATGPU -eq 1 ]; then
echo "executing: ${MYARGS}"
${MYARGS}
# namd
elif [ -n "$CHARMRUN" ] && [ $CHARMRUN -eq 1 ]; then
cat ${MACHFILE}.lst | tr '\/ ' '\r\n' | sed 's/^/host /g' > ${MACHFILE}
echo "executing: charmrun $NAMD_DIR/namd2 +p$cpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}"
charmrun $NAMD_DIR/namd2 +p$cpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}
# all else (lammps, amber, ...)
else
echo "executing: ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp ${MYARGS}"
${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp ${MYARGS}
fi
exit $?


