DokuWiki

GTX 1080 Ti


GPU	GTX 1080 Ti
Transistor Count	12 billion
Nvidia Cores	3,584
FP32 (single precision) Teraflops	11.4
FP64 (double precision) Teraflops	0.355
Memory Capacity	11GB
Power	250 Watt
Price	$700
Maximum GPU Temperature	91 (in C)

We have Enterprise Level Telsa K20 GPU compute nodes (graphical processing units). Four per node for a total of 20 K20s capable of roughly 23 total Teraflops (floating point, double precision). $100K in 2013.

We have received a donation for a special purpose research project and have acquired a 1U rack mounted compute node which will hold 4 GTX1080Ti GPUs (with 128 GB of CPU memory and two E5-2650v4 8 core chips). $8K in 2017

GTX 4-gpu node: About 44 FP32 and 1.5 FP64 teraflops in 1U. $8,200 in 2017.
K20 4-gpu node: About 14 FP32 and 4.7 FP64 teraflops in 2U. $20,000 in 2013.

We will disable Persistent and Exclusive Compute modes in our K20 GPU nodes, which Gromacs and Lammps tend to prefer. This new GTX GPU node will have these modes enabled as is suggested on the Amber web site. Amber, except for starting and finishing a computational job, does not use the CPU once the job is launched in the GPU. Gromacs and Lammps do.

The new GTX environment will be similar as our K20, scheduler wise. There will be a tiny queue amber128 with 24 cpu cores, 4 gpus and one node. New utilities are provided to interact with the GPUs and feed usage information to the scheduler.

# query
[root@cottontail ~]# ssh n78 gtx-info

index, name, temperature.gpu, memory.used [MiB], memory.free [MiB], utilization.gpu [%], utilization.memory [%]
0, GeForce GTX 1080 Ti, 31, 0 MiB, 11172 MiB, 0 %, 0 %
1, GeForce GTX 1080 Ti, 24, 0 MiB, 11172 MiB, 0 %, 0 %
2, GeForce GTX 1080 Ti, 26, 0 MiB, 11172 MiB, 0 %, 0 %
3, GeForce GTX 1080 Ti, 26, 0 MiB, 11172 MiB, 0 %, 0 %

[root@cottontail ~]# ssh n78 gtx-free
3,1,0,2

[root@cottontail ~]# lsload -l n78
HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu
n78                         ok   0.0   0.1   0.1   0%   0.0     0   0 2e+08 1710G   10G  125G    4.0 <--

Our mvapich2 wrapper will be cloned to mpich3 for this new server (posted in /usr/local/bin as to not confuse other nodes). It will still allow you to request up to 8 CPU cores per GPU resource for non-Amber jobs. For Amber jobs, the priority usage of this node for the next three years, you only need to request one CPU core for each GPU job (-n 1, gpu=1), see example script below. Amber can run across two GPUs with peer to peer support enabled but the wrapper does not support this currently. (Can only be done in physical pairs…GPU 0&2 and 1&3). We will see if we need that. This would be the case if your simulation needs between 11-22G of memory on GPU side.

Look at /usr/local/[Amber16_Benchmark_Suite|GPU_Validation_Test] on n78, lots of good amber stuff to browse there. Amber itself is installed /usr/local/amber16. Please note that the default pmemd.cuda[.MPI] points to SPFP, single precision floating point, as is/was the case in all our other HPC Amber installations. But DPFP (double precision) binary is available.

There is no BLCR kernel modules and X11 support is disabled in the GPUs. Video Mode is onboard, meaning users can not use the system interactively with VMD or other 3-D applications.

Operating system is CentOS7.4

/localscratch on n78 internal hard drive has 1.5T available.

# sample script, see bottom of page
# my test run, and I really do not know what I'm doing here,
# showed this GTX1080ti to be 4x faster than a K20 gpu
# a K20 runs this same test 10x faster than a CPU

We will also bring another 33T of storage online for the estimated content this new GPU server will generate in the next three years. Some users' home directories will be moved to this storage array with no quota policies. That will free up space in sharptail:/home so I'm again delaying enlarging this partition. Maybe over Christmas break if a “island power test” is scheduled and we need to shut down.

Server arrived Oct 12, 2017.

# design seems to yield one gpu much hotter than the 3 in the front of the server 
# (air flows bottom to top in picture), so we'll probably blow that one first 
# (max 91C or 196F, 83C is 181F). Infrared Thermometer shows 56-60C on that gpu pointed directly at it.
index, name, temperature.gpu, memory.used [MiB], memory.free [MiB], utilization.gpu [%], utilization.memory [%]
0, GeForce GTX 1080 Ti, 83, 352 MiB, 10820 MiB, 55 %, 1 %
1, GeForce GTX 1080 Ti, 67, 352 MiB, 10820 MiB, 66 %, 1 %
2, GeForce GTX 1080 Ti, 66, 352 MiB, 10820 MiB, 56 %, 1 %
3, GeForce GTX 1080 Ti, 63, 352 MiB, 10820 MiB, 57 %, 1 %

# note, from nvidia-smi --help-query-gpu
"temperature.gpu"
 Core GPU temperature. in degrees C.

From vendor "80-83C is good actually. In some warmer environments you would be seeing 85-87C which is still just fine for 24/7 operation anyway."

Surprise. The server came with 34“ rails but the server itself is 36” deep. That does not fit in my racks with the power distribution running vertical in the back of the rack. Oh well. Squeezed it in the very bottom 1U of a rack just clearing power cords.

Bench

Amber 16. Nucleosome bench runs 4.5x faster than on a K20
- Not sure it is representative of our work load
- Adding more MPI threads decreases performance
- Running across more gpus (2 or 4) decreases performance
- One Amber process per MPI thread per GPU is optimal

Wow, I just realized the most important metric: Our k20 has a job throughput of 20 per unit of time. The amber128 queue will have a throughput of 4*4.5 or 18 per same unit of time. One new server matches five old ones, well purchased in 2013. From an amber only perspective.

nvidia-smi -pm 0; nvidia-smi -c 0
# gpu_id is done via CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=$STRING_2
# on n78
/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f /home/hmeij/amber/nucleosome/hostfile \
-n $STRING_1 $AMBERHOME/bin/pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU \
-p prmtop -c inpcrd -ref inpcrd ; grep 'ns/day' /tmp/mdout
# on n34
/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh -hostfile /home/hmeij/amber/nucleosome/hostfile2 \
-np $STRING_1  pmemd.cuda.MPI -O -o /tmp/mdout -i mdin.GPU -p prmtop -c inpcrd -ref inpcrd; grep 'ns/day' /tmp/mdout


Nucleosome Metric ns/day, seconds/ns  across all steps  x  nr of gpus


GTX on n78

-n 1, -gpu_id 0
|         ns/day =      12.24   seconds/ns =    7058.94   x4 = 48.96  (4.5 faster than k20)
-n 2, -gpu_id 0
|         ns/day =      11.50   seconds/ns =    7509.97
-n 4, -gpu_id 0
|         ns/day =      10.54   seconds/ns =    8197.80
-n 4, -gpu_id 01
|         ns/day =      20.70   seconds/ns =    4173.55   x2 = 41.40
-n 8, -gpu_id 01
|         ns/day =      17.44   seconds/ns =    4953.04
-n 4, -gpu_id 0123
|         ns/day =      32.90   seconds/ns =    2626.27   x1
-n 8, -gpu_id 0123
|         ns/day =      28.43   seconds/ns =    3038.72   x1


K20 on n34 

-n 1, -gpu_id 0
|             ns/day =       2.71   seconds/ns =   31883.03
-n 4, -gpu_id 0
|             ns/day =       1.53   seconds/ns =   56325.00
-n4, -gpuid 0123
|             ns/day =       5.87   seconds/ns =   14730.45

Gromacs 5.1.4 My (Colin's) multidir bench runs about 2x faster than on a K20
- Can probably be improved
- 4 multidirs on 4 gpus achieves sweet spot at roughly 350 ns/day

# about 20 mins per run
/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile $STRING_1 \
gmx_mpi mdrun -nsteps 600000 $STRING_2 -gpu_id $STRING_3 \
-ntmpi 0 -npme 0 -s topol.tpr -ntomp 0 -pin on -nb gpu  

# Gromacs seems to have a mind of it's own
On host n78 4 GPUs user-selected for this run.
Mapping of GPU IDs to the 4 PP ranks in this node: 0,1,2,3 (-n<=4)

Metric:          (ns/day)    (hour/ns) (x? = ??? ns/day)

-n 1, -multidir 01, -gpu_id 0
Using 4 MPI processes
Using 8 OpenMP threads per MPI process
Performance:      123.679        0.194 (x1)

-n2, -multidir 01 02, -gpu_id 01
Using 2 MPI processes
Using 8 OpenMP threads per MPI process
Performance:       95.920        0.250 (x2 = 191.84)

-n 4, -multidir 01 02 03 04, -gpu_id 0123
Using 1 MPI process
Using 8 OpenMP threads 
Performance:       87.220        0.275 (x4 = 348.88)

n 8, -multidir 01 02 03 04 05 06 07 08, -gpu_id 00112233
Using 1 MPI process
Using 4 OpenMP threads                            
cudaMallocHost of size 1024128 bytes failed: all CUDA-capable devices are busy or unavailable
Ahh, nvidia compute modes need to be -pm 0 & -c 0 for gromacs ...
NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss.
Performance:       45.070        0.533 (x8 = 360.56)

-n 16 (max physical cpu cores), -multidir 01 02 ... 15 16, -gpu_id 0000111122223333
Using 1 MPI process
Using 2 OpenMP threads 
Mapping of GPU IDs to the 16 PP ranks in this node: 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
Performance:       19.814        1.211 (x16 = 317.024)

# UPDATE Gromacs 2018, check out these new performance stats for -n 4, -gpu=4

# K20, redone with cuda 9

root@cottontail gpu]# egrep 'ns/day|Performance' 0[0-4]/md.log
01/md.log:                 (ns/day)    (hour/ns)
01/md.log:Performance:       74.275        0.323
02/md.log:                 (ns/day)    (hour/ns)
02/md.log:Performance:       74.111        0.324
03/md.log:                 (ns/day)    (hour/ns)
03/md.log:Performance:       73.965        0.324
04/md.log:                 (ns/day)    (hour/ns)
04/md.log:Performance:       74.207        0.323

# GTX1080 cuda 8
 
[hmeij@cottontail gpu]$ egrep 'ns/day|Performance' 0[1-4]/md.log
01/md.log:                 (ns/day)    (hour/ns)
01/md.log:Performance:      229.229        0.105
02/md.log:                 (ns/day)    (hour/ns)
02/md.log:Performance:      221.936        0.108
03/md.log:                 (ns/day)    (hour/ns)
03/md.log:Performance:      217.618        0.110
04/md.log:                 (ns/day)    (hour/ns)
04/md.log:Performance:      228.854        0.105

Almost 900 ns/day for a single server.

Lammps 11Aug17 runs about 11x faster than K20
- used the colloid example, not sure if that's a good example
- like gromacs, lots of room for improvements
- used the double-double binary,surprised at speed
  - single-double binary might run faster?

nvidia-smi -pm 0; nvidia-smi -c 0
# gpu_id is done via CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVCES=$STRING_2
# on n78
/usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile -n $STRING_1 \
/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu \
$STRING_3 -in in.colloid > /tmp/out ; grep tau /tmp/out
# on n34
/cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh \
-hostfile /home/hmeij/sharptail/hostfile2 -np $STRING_1 \
/share/apps/CENTOS6/lammps/31Mar17/lmp_gpu_double \
-suffix gpu $STRING_3  -in in.colloid > /tmp/out ; grep tau /tmp/out



Created 5625 atoms

-n 1, -gpu_id 0
Performance: 581,359 tau/day, 1,345 timesteps/s 
-n 2, -gpu_id 01
Performance: 621,822 tau/day, 1,439 timesteps/s 
-n 4, -gpu_id 0123
Performance: 479,795 tau/day, 1,110 timesteps/s 

-n 4, -gpu_id 01, -pk gpu 2
Performance: 819,207 tau/day, 1,896 timesteps/s 
-n 8, -gpu_id 01, -pk gpu 2
Performance: 519,173 tau/day, 1,201 timesteps/s 
-n 6, -gpu_id 0123, -pk gpu 4
Performance: 881,981 tau/day, 2,041 timesteps/s
-n 8, -gpu_id 0123, -pk gpu 4
Performance: 932,493 tau/day, 2,158 timesteps/s (11x K20)
-n 16, -gpu_id 0123, -pk gpu 4
Performance: 582,717 tau/day, 1,348 timesteps/s


K20 on n34 

-n8, -gpuid 0123, -pk gpu 4
Performance: 84985 tau/day, 196 timesteps/s 


GTX on n78 again 
-n 8, -gpu_id 0123, -pk gpu 4

Created 22500 atoms
Performance: 552,986 tau/day, 1,280 timesteps/s
Created 90000 atoms
Performance: 210,864 tau/day, 488 timesteps/s

Scripts

/home/hmeij/sharptail/run.gtx, an Amber/Gromacs/Lammps specific submit script

#!/bin/bash
# submit via 'bsub < run.gtx'
rm -f out err 
#BSUB -e err
#BSUB -o out
#BSUB -q amber128
#BSUB -J "GTX test"

# cuda 8 & mpich
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/mpich-3.1.4/bin:$AMBERHOME/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/mpich-3.1.4/lib:$LD_LIBRARY_PATH


## leave sufficient time between job submissions (30-60 secs)
## the number of GPUs allocated matches -n value automatically
## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
## reserve 12288 MB (11 GB + 1 GB overhead) memory per GPU
## run all processes (1<=n<=4)) on same node (hosts=1).


# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
cd $MYLOCALSCRATCH


# uncomment one software block by removing one on # each line


## AMBER we need to recreate env, $AMBERHOME is already set
##BSUB -n 1
##BSUB -R "rusage[gpu=1:mem=12288],span[hosts=1]"
#export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH
#export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH
#source /usr/local/amber16/amber.sh
## stage the data
#cp -r ~/sharptail/* .
## feed the wrapper
#n78.mpich3.wrapper pmemd.cuda.MPI \
#-O -o mdout.$LSB_JOBID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd
## save results
#scp mdout.$LSB_JOBID ~/sharptail/


# GROMACS (using all GPUs example)
#BSUB -n 4
#BSUB -R "rusage[gpu=4:mem=49152],span[hosts=1]"
export CPU_GPU_REQUEST=4:4
# signal GMXRC is a gpu run with: 1=thread_mpi 2=mpich3
export GMXRC=2
export PATH=/usr/local/gromacs-5.1.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/gromacs-5.1.4/lib64:$LD_LIBRARY_PATH
. /usr/local/gromacs-5.1.4/bin/GMXRC.bash
cd /home/hmeij/gromacs_bench/gpu/
n78.mpich3.wrapper gmx_mpi mdrun \
  -nsteps 600000 -multidir 01 02 03 04 -gpu_id 0123 -ntmpi 0 -npme 0 -s topol.tpr -ntomp 0 -pin on -nb gpu



# LAMMPS
#BSUB -n 1
#BSUB -R "rusage[gpu=1:mem=12288],span[hosts=1]"
# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header input file)
export GPUIDX=1 # use with -var $GPUIDX in inout file, view au.in, or use -suffix 
export PATH=/usr/local/lammps-11Aug17:$PATH
# stage the data
cp -r ~/sharptail/* .
# feed the wrapper
n78.mpich3.wrapper lmp_mpi-double-double-with-gpu \
-suffix gpu -var GPUIDX $GPUIDX -in in.colloid -l out.colloid.$LSB_JOBID
# save results
scp out.colloid.$LSB_JOBID ~/sharptail/

/usr/local/bin/gtx-info

#!/bin/bash

nvidia-smi \
--query-gpu=index,gpu_name,temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory \
--format=csv

/usr/local/bin/gtx-free

#!/bin/bash

# adapted for GTX1080ti -hmeij
 
 # gpu-free
 # Written by Jodi Hadden, University of Georgia, 2011. Updated 2013.
 # Print free GPU deviceQuery IDs to export as visible with CUDA_VISIBLE_DEVICES
 # Works for NVIDIA-SMI 3.295.41 Driver Version: 295.41 for Tesla series cards
 
 # Check for deviceQuery
 if ! type nvidia-smi >/dev/null
 then
         echo "Please add the location of nvidia-smi to your PATH!"
         exit
 fi
 
 # Make all GPUs temporarily visible
 num_gpus=4
 export CUDA_VISIBLE_DEVICES=`seq -s , 0 $num_gpus`
 
 pci_bus_array=( 0 1 2 3 )
 
 # For each PCI bus ID, check to see if that GPU is being utilized 
 gpu_id=0
 while [ $gpu_id -lt ${#pci_bus_array[*]} ]
 do
 
         # Get utilization from NVIDIA-SMI
         utilization=`nvidia-smi --id=$gpu_id --query-gpu=utilization.gpu --format=csv | grep -v util | awk '{print $1}'`
         # If a GPU is not being utilized, add its deviceQuery ID to the free GPU array
         # Note: GPUs can show 1% utilization if NVIDIA-SMI is running in the background, so used -le 1 instead of -eq 0 here
         if [ $utilization -le 1 ]
         then
                 free_gpu_array[$gpu_id]=$gpu_id
         fi
 
         let gpu_id=$gpu_id+1
 
 done
 
 # Print free GPUs to export as visible 
 free_gpus=`shuf -e ${free_gpu_array[*]}`
 echo $free_gpus | sed "s/ /,/g"

and the new wrapper referenced in submit script run.gtx

#!/bin/sh

# This is a copy of lava.openmpi.wrapper which came with lava OCS kit
# Trying to make it work with mpich3 for n78 Amber16
# -hmeij 18oct2017

#
#  Copyright (c) 2007 Platform Computing
#
# This script is a wrapper for openmpi mpirun
# it generates the machine file based on the hosts
# given to it by Lava.
#

# RLIMIT_MEMLOCK problem with libibverbs -hmeij
ulimit -l unlimited


usage() {
        cat <<USEEOF
USAGE:  $0
        This command is a wrapper for mpirun (openmpi).  It can
        only be run within Lava using bsub e.g.
                bsub -n # "$0 -np # {my mpi command and args}"

        The wrapper will automatically generate the
        machinefile used by mpirun.

        NOTE:  The list of hosts cannot exceed 4KBytes.
USEEOF
}

if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then
    usage
    exit -1
fi

MYARGS=$*
WORKDIR=`dirname ${LSB_JOBFILENAME}`
MACHFILE=${WORKDIR}/mpi_machines.${LSB_JOBID}
ARGLIST=${WORKDIR}/mpi_args.${LSB_JOBID}

# Check if mpirun is in the PATH, should be /usr/local/mpich-3.1.4/bin/mpirun -hmeij
T=`which --skip-alias mpirun`
if [ $? -ne 0 ]; then
    echo "Error:  mpirun is not in your PATH."
    exit -2
fi

echo "${MYARGS}" > ${ARGLIST}
#T=`grep -- -machinefile ${ARGLIST} |wc -l`
#T=`grep -- -hostfile ${ARGLIST} |wc -l`
T=`grep -- -f ${ARGLIST} |wc -l`
if [ $T -gt 0 ]; then
    echo "Error:  Do not provide the machinefile for mpirun."
    echo "        It is generated automatically for you."
    exit -3
fi

# Make the open-mpi machine file
echo "${LSB_HOSTS}" > ${MACHFILE}.lst
tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}

MPIRUN=`which --skip-alias mpirun`

# sanity checks number of cpu processes 1-8
np=`wc -l ${MACHFILE} | awk '{print $1}'` 
if [ $np -lt 1 -o $np -gt 8 ]; then
    echo "Error:  Incorrect number of processes ($np)"
    echo "        -n can be an integer in the range of 1 to 8"
    exit -4
fi

# sanity check single node
nh=`cat ${MACHFILE} | sort -u | wc -l` 
if [ $nh -ne 1 ]; then
    echo "Error:  No host or more than one host specified ($nh)"
    exit -5
fi

# always on one host
gpuhost=`cat ${MACHFILE} | sort -u | tr -d '\n'`

# next choose defaul policy (one cpu per gpu) or custom policy (eg 2:1, 4:2, max 8:4)
if [ -n "$CPU_GPU_REQUEST" ]; then
        gpunp=`echo $CPU_GPU_REQUEST | awk -F: '{print$2}'`
        cpunp=`echo $CPU_GPU_REQUEST | awk -F: '{print$1}'`
        gpuid=( $(for i in `ssh $gpuhost gtx-free | sed "s/,/ /g"`; do echo $i; done | head -$gpunp) )
        echo "REQUEST allocation $gpuhost:gpunp=$gpunp, cpunp=$cpunp"
else
        cpunp=`cat ${MACHFILE} | wc -l | awk '{print $1}'`
        gpunp=1
        gpuid=( $(for i in `ssh $gpuhost gtx-free | sed "s/,/ /g"`; do echo $i; done | head -$gpunp) )
fi

if [ $gpunp -eq 1 ]; then
        CUDA_VISIBLE_DEVICES=$gpuid
        echo "GPU allocation $gpuhost:$gpuid"
else
        gpuids=`echo ${gpuid[@]} | sed "s/ /,/g"`
        CUDA_VISIBLE_DEVICES="$gpuids"
        echo "GPU allocation $gpuhost:$CUDA_VISIBLE_DEVICES"
fi

# namd ignores this
export CUDA_VISIBLE_DEVICES
#debug# setid=`ssh $gpuhost echo $CUDA_VISIBLE_DEVICES | tr '\n' ' '`
#debug# echo "setid=$setid";

# gromacs
if [ -n "$GMXRC" ]; then
        # gromacs needs them from base 0, so gpu 2,3 is string 01
        if [ ${#gpuid[*]} -eq 1 ]; then
                gmxrc_gpus="0"
        elif [ ${#gpuid[*]} -eq 2 ]; then
                gmxrc_gpus="01"
        elif [ ${#gpuid[*]} -eq 3 ]; then
                gmxrc_gpus="012"
        elif [ ${#gpuid[*]} -eq 4 ]; then
                gmxrc_gpus="0123"
        fi

        if [ $GMXRC -eq 1 ]; then
                newargs=`echo ${MYARGS} | sed "s/mdrun/mdrun -gpu_id $gmxrc_gpus/g"`
                echo "executing: $newargs"
                $newargs
        elif [ $GMXRC -eq 2 ]; then
                newargs=`echo ${MYARGS} | sed "s/mdrun_mpi/mdrun_mpi -gpu_id $gmxrc_gpus/g"`
                echo "executing: ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp $newargs"
                ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp $newargs
        fi

# matlab
elif [ -n "$MATGPU" ] && [ $MATGPU -eq 1 ]; then
        echo "executing: ${MYARGS}"
        ${MYARGS}

# namd
elif [ -n "$CHARMRUN" ] && [ $CHARMRUN -eq 1 ]; then
        cat ${MACHFILE}.lst | tr '\/ ' '\r\n' | sed 's/^/host /g' > ${MACHFILE}
        echo "executing: charmrun $NAMD_DIR/namd2 +p$cpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}"
        charmrun $NAMD_DIR/namd2 +p$cpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}

# all else (lammps, amber, ...)
else
        echo "executing: ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp ${MYARGS}"
        ${MPIRUN} -launcher ssh -f ${MACHFILE} -n $cpunp ${MYARGS}
fi

exit $?

PPMA Bench

Runs fastest when constrined to one gpu with 4 mpi threads
Room for improvement as gpu and gpu memory are not fully utilized
Adding mpi threads or more gpus reduces ns/day performance
No idea if adding omp threads shows a different picture
No idea how it compares to K20 gpus

nvidia-smi -pm 0; nvidia-smi -c 0
# gpu_id is done via CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVCES=[0,1,2,3]

# on n78
cd /home/hmeij/lammps/benchmark
rm -f /tmp/lmp-run.log;rm -f *.jpg;\
time /usr/local/mpich-3.1.4/bin/mpirun -launcher ssh -f ./hostfile  -n $STRING_1 \
/usr/local/lammps-11Aug17/lmp_mpi-double-double-with-gpu -suffix gpu -pk gpu $STRING_2 \
-in nvt.in -var t 310 > /dev/null 2>&1; grep ^Performance /tmp/lmp-run.log


PMMA Benchmark Performance Metric ns/day (x  nr of gpus for node output)


Lammps 11Aug17 on GTX1080Ti (n78)

-n 1, -gpu_id 3
Performance: 19.974 ns/day, 1.202 hours/ns, 231.176 timesteps/s
3, GeForce GTX 1080 Ti, 38, 219 MiB, 10953 MiB, 30 %, 1 %                                                      
-n 2, -gpu_id 3
Performance: 33.806 ns/day, 0.710 hours/ns, 391.277 timesteps/s
3, GeForce GTX 1080 Ti, 57, 358 MiB, 10814 MiB, 47 %, 3 %
-n 4, -gpu_id 3
Performance: 48.504 ns/day, 0.495 hours/ns, 561.388 timesteps/s (x4 = 194 ns/day/node)
3, GeForce GTX 1080 Ti, 59, 690 MiB, 10482 MiB, 76 %, 4 %
-n 8, -gpu_id 3
Performance: 37.742 ns/day, 0.636 hours/ns, 436.833 timesteps/s
3, GeForce GTX 1080 Ti, 47, 1332 MiB, 9840 MiB, 90 %, 4 %
-n 4, -gpu_id 01
Performance: 57.621 ns/day, 0.417 hours/ns, 666.912 timesteps/s 
0, GeForce GTX 1080 Ti, 48, 350 MiB, 10822 MiB, 50 %, 3 %
1, GeForce GTX 1080 Ti, 37, 344 MiB, 10828 MiB, 49 %, 3 %
-n 8, -gpu_id 01
Performance: 63.625 ns/day, 0.377 hours/ns, 736.400 timesteps/s (x2 = 127 ns/day/node)
0, GeForce GTX 1080 Ti, 66, 670 MiB, 10502 MiB, 77 %, 4 %
1, GeForce GTX 1080 Ti, 51, 670 MiB, 10502 MiB, 81 %, 4 %
-n 12, -gpu_id 01
Performance: 61.198 ns/day, 0.392 hours/ns, 708.315 timesteps/s
0, GeForce GTX 1080 Ti, 65, 988 MiB, 10184 MiB, 82 %, 4 %
1, GeForce GTX 1080 Ti, 50, 990 MiB, 10182 MiB, 85 %, 4 %
-n 8, -gpu_id 0123
Performance: 86.273 ns/day, 0.278 hours/ns, 998.534 timesteps/s 
0, GeForce GTX 1080 Ti, 56, 340 MiB, 10832 MiB, 57 %, 3 %
1, GeForce GTX 1080 Ti, 41, 340 MiB, 10832 MiB, 52 %, 2 %
2, GeForce GTX 1080 Ti, 43, 340 MiB, 10832 MiB, 57 %, 3 %
3, GeForce GTX 1080 Ti, 42, 340 MiB, 10832 MiB, 55 %, 2 %
-n 12, -gpuid 0123
Performance: 108.905 ns/day, 0.220 hours/ns, 1260.478 timesteps/s (x1 = 109 ns/day/node)
-n 16
Performance: 88.989 ns/day, 0.270 hours/ns, 1029.964 timesteps/s



# on n34
unable to get it to run...

K20 on n34 

-n 1, -gpu_id 0
-n 4, -gpu_id 0
-n 4, -gpuid 0123

# comparison of binaries running PMMA
# 1 gpu 4 mpi threads each run

# lmp_mpi-double-double-with-gpu.log
Performance: 49.833 ns/day, 0.482 hours/ns, 576.769 timesteps/s
# lmp_mpi-single-double-with-gpu.log
Performance: 58.484 ns/day, 0.410 hours/ns, 676.899 timesteps/s
# lmp_mpi-single-single-with-gpu.log
Performance: 56.660 ns/day, 0.424 hours/ns, 655.793 timesteps/s

FSL

User Time Reported from time command

mwgpu cpu run
2013 model name : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
- All tests 45m
- Bft test 16m28s (bedpostx)

amber128 cpu run
2017 model name : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
- All tests 17m - 2.5x faster
- Bft test 3m39s - 6x faster (bedpostx)

amber128 gpu run
2017 CUDA Device Name: GeForce GTX 1080 Ti
- Bft gpu test 0m1.881s (what!? from command line) - 116x faster (bedpostx_gpu)
- Bft gpu test 0m1.850s (what!? via scheduler) - 118x faster (bedpostx_gpu)

FreeSurfer

http://freesurfer.net/fswiki/DownloadAndInstall#TestyourFreeSurferInstallation
Example using sample-001.mgz

Node n37 (mwgpu cpu run)
(2013) Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
recon-all -s bert finished without error
example 1 user    0m3.516s
example 2 user    893m1.761s ~15 hours
example 3 user    ???m       ~15 hours (estimated)

Node n78 (amber128 cpu run)
(2017) Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
recon-all -s bert finished without error
example 1 user    0m2.315s
example 2 user    488m49.215s ~8 hours
example 3 user    478m44.622s ~8 hours


freeview -v \
    bert/mri/T1.mgz \
    bert/mri/wm.mgz \
    bert/mri/brainmask.mgz \
    bert/mri/aseg.mgz:colormap=lut:opacity=0.2 \
    -f \
    bert/surf/lh.white:edgecolor=blue \
    bert/surf/lh.pial:edgecolor=red \
    bert/surf/rh.white:edgecolor=blue \
    bert/surf/rh.pial:edgecolor=red

Development code for the GPU http://surfer.nmr.mgh.harvard.edu/fswiki/freesurfer_linux_developers_page

Back

Table of Contents

GTX 1080 Ti

Bench

Scripts

PPMA Bench

FSL

FreeSurfer