\\ **[[cluster:0|Back]]** ==== Submitting GPU Jobs ==== Please plenty of time between multiple GPU job submissions. Like minutes. Jobs need to be submitted to the scheduler via cottontail to queues mwgpu, amber128, exx96. This page is old, the gpu resource ''gpu4'' should be used, a more recent page can be found [[cluster:173|K20 Redo Usage]]. Although there might some useful information on this page explaining gpu jobs. --- //[[hmeij@wesleyan.edu|Henk]] 2021/06/17 15:29// **Articles** * [[http://www.pgroup.com/lit/articles/insider/v5n2a1.htm]] Tesla vs. Xeon Phi vs. Radeon: A Compiler Writer's Perspective * [[http://www.pgroup.com/lit/articles/insider/v5n2a5.htm]] Calling CUDA Fortran kernels from MATLAB The process for submitting GPU jobs on the mwgpu queue is described below. We have made the initial assumption, until we get more data and some experience, to implement the following simplistic model: * one cpu core is needed for one GPU job with exclusive access to that GPU * GPUs have 5 GB of memory so CPU core needs that plus 20% * if a job uses more than one GPU, they must all remain on same node (max=4) An "elim" has been written for the Lava scheduler that reports the number of available idle GPUs. To write and set up an "elim" read this page: [[cluster:49|eLIM]]. We can view what the scheduler has access to using ''lsload'' to observe the idle GPUs. After we submit the job, the scheduler reserves the GPU. That can be viewed with the command ''bhosts''. Note that the GPU could still be idle, it may take a bit of time for the code to spin up and actually use that GPU.



[hmeij@sharptail sharptail] lsload -l n33

HOST_NAME               status  r15s   r1m  r15m   ut    pg    io  ls    it   tmp   swp   mem    gpu <---
n33                         ok  25.0  26.1  26.0  80%   5.0   710   3  1464   72G   25G   30G    4.0

[hmeij@sharptail sharptail]$ bsub < run.gpu 

Job <23259> is submitted to queue .  

[hmeij@sharptail sharptail]$ bjobs
                                                             
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
23259   hmeij   RUN   mwgpu      sharptail   n33         test       Aug 15 

[hmeij@sharptail sharptail]$ bhosts -l n33
HOST  n33
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok             100.00     -     28      5      5      0      0      0      -

 CURRENT LOAD USED FOR SCHEDULING:
              r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem    gpu  
 Total         0.1   0.1   0.1   64%   2.2   188    1  3028   72G   27G   56G    3.0
 Reserved      0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M 6144M    1.0  <---

With ''gpu-info'' we can view our running job. ''gpu-info'' and ''gpu-free'' are available ~~[[http://ambermd.org/gpus/]]~~ [[http://ambermd.org/gpus12/#Running]](I had to hard code my GPU string information as they came in at 02,03,82&83, you can use deviceQuery to find them).



[hmeij@sharptail sharptail]$ ssh n33 gpu-info
unloading gcc module
====================================================
Device  Model           Temperature     Utilization
====================================================
0       Tesla K20m      25 C            0 %
1       Tesla K20m      27 C            0 %
2       Tesla K20m      32 C            99 %
3       Tesla K20m      21 C            0 %
====================================================

[hmeij@sharptail sharptail]$ ssh n33 gpu-free
1,3,0

This is obviously our job running on GPU instance '2'. But Amber reports that GPU instance '0' is being used. Here is what mdout reports:



|   CUDA Capable Devices Detected:      1
|           CUDA Device ID in use:      0

The reason for this is that CUDA_VISIBLE_DEVICES is used in the wrapper program to mask the GPU instance IDs. So the job got instructed that a GPU is available and it is the first one, thus '0'. So if all the GPU are running it's hard to find your job. So inside the wrapper we can grab the real GPU instance ID and report it to standard out. For this job the STDOUT reports that just before MPIRUN is started. (With Lava look for STDOUT in a file called ~/.lsbatch/[0-9]*.LSF_JOBPID.out).



GPU allocation instance n33:2                                                                        
executing: /cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh -hostfile \
/home/hmeij/.lsbatch/mpi_machines -np 1 pmemd.cuda.MPI \
-O -o mdout.23259 -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd

The code that assigns and inserts and handles the CUDA_VISIBLE_DEVICES is shown in the lava.mvapich2.wrapper section below. The program below shows examples of how to run Amber and Lammps jobs. Please note that you should always reserve a GPU (gpu=1) otherwise jobs may crash and GPUs can become over committed. You may wish to perform a cpu run only for comparisons (gpu=0) or debugging purposes (but will be limited to a max of 4 job slots per node). Do not launch GPU enabled software during such a run. pmemd.cuda.MPI is GPU enable so instead invoke pmemd.MPI. In the case of Lammps you can toggle between GPU enabled software and MPI enabled software using a environment variable. Here is the header of the Lammps input file



# Enable GPU code if variable is set.
if "(${GPUIDX} > 0)" then &
        "suffix gpu" &
        "newton off" &
        "package gpu force 0 ${GPUIDX} 1.0"
#       "package gpu force 0 ${GPUIDX} 1.0 threads_per_atom 2"
echo    both

NAMD works slightly different again and has everything build in, also it's hostfile is a bit different. The wrapper will set up the environment and invoke NAMD with several preset flags and then add whatever arguments you provide. Again, -n will match the number of GPUs allocated on a single node. For example



GPU allocation instance n37:1,3,0
charmrun /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02//namd2 \
+p3 ++nodelist /home/hmeij/.lsbatch/mpi_machines +idlepoll +devices 1,3,0 \
apoa1/apoa1.namd

==== run.gpu ====



#!/bin/bash
# submit via 'bsub < run.gpu'
rm -f mdout.[0-9]* auout.[0-9]* apoa1out.[0-9]*
#BSUB -e err
#BSUB -o out
#BSUB -q mwgpu
#BSUB -J test

# from greentail we need to set up the module env
export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:\
/cm/shared/apps/cuda50/libs/current/bin:/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:\
/cm/shared/apps/mvapich2/gcc/64/1.6/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
/cm/shared/apps/mvapich2/gcc/64/1.6/lib


## leave sufficient time between job submissions (30-60 secs)
## the number of GPUs allocated matches -n value automatically
## always reserve GPU (gpu=1), setting this to 0 is a cpu job only
## reserve 6144 MB (5 GB + 20%) memory per GPU
## run all processes (1<=n<=4)) on same node (hosts=1).

#BSUB -n 1
#BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]"

# unique job scratch dirs
MYSANSCRATCH=/sanscratch/$LSB_JOBID
MYLOCALSCRATCH=/localscratch/$LSB_JOBID
export MYSANSCRATCH MYLOCALSCRATCH
cd $MYSANSCRATCH

# AMBER
# stage the data
cp -r ~/sharptail/* .
# feed the wrapper
lava.mvapich2.wrapper pmemd.cuda.MPI \
-O -o mdout.$LSB_JOBID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd
# save results
cp mdout.$LSB_JOBID ~/sharptail/

# LAMMPS
# GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp)
export GPUIDX=1 
# stage the data
cp -r ~/sharptail/* .
# feed the wrapper
lava.mvapich2.wrapper lmp_nVidia \
-c off -var GPUIDX $GPUIDX -in au.inp -l auout.$LSB_JOBID
# save results
cp auout.$LSB_JOBID ~/sharptail/

# NAMD 
# signal that this is charmrun/namd job
export CHARMRUN=1
# stage the data
cp -r ~/sharptail/* .
# feed the wrapper
lava.mvapich2.wrapper \
apoa1/apoa1.namd > apoa1out.$LSB_JOBID
# save results
cp apoa1out.$LSB_JOBID ~/sharptail/

==== gromacs.sub ====



#!/bin/bash

rm -rf gromacs.out gromacs.err \#* *.log

# from greentail we need to recreate module env
export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:\
/cm/shared/apps/cuda50/libs/current/bin:/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:\
/cm/shared/apps/mvapich2/gcc/64/1.6/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
/cm/shared/apps/mvapich2/gcc/64/1.6/lib

#BSUB -o gromacs.out
#BSUB -e gromacs.err
#BSUB -N
#BSUB -J 325monolayer

# read /share/apps/gromacs/build.sh
. /share/apps/intel/composerxe/bin/iccvars.sh intel64
export VMDDIR=/share/apps/vmd/1.8.6

## CPU RUN: queue mw256, n<=28, must run on one node (thread_mpi)
##BSUB -q mw256
##BSUB -n 2
##BSUB -R "rusage[gpu=0],span[hosts=1]"
#export PATH=/share/apps/gromacs/4.6-icc-gpu/bin:$PATH
#. /share/apps/gromacs/4.6-icc-gpu/bin/GMXRC.bash
#mdrun -nt 2 -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc

## GPU RUN: gpu (1-4), queue mwgpu, n (1-4, matches gpu count), must run on one node
##BSUB -q mwgpu
##BSUB -n 1
##BSUB -R "rusage[gpu=1:mem=7000],span[hosts=1]"
## signal GMXRC is a gpu run with: 1=thread_mpi
#export GMXRC=1
#export PATH=/share/apps/gromacs/4.6-icc-gpu/bin:$PATH
#. /share/apps/gromacs/4.6-icc-gpu/bin/GMXRC.bash
#lava.mvapich2.wrapper mdrun \
#-testverlet -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc

# GPU RUN: gpu (1-4), queue mwgpu, n (1-4, matches gpu count), must run on one node
#BSUB -q mwgpu
#BSUB -n 1
#BSUB -R "rusage[gpu=1:mem=7000],span[hosts=1]"
# signal GMXRC is a gpu run with: 2=mvapich2
export GMXRC=2
export PATH=/share/apps/gromacs/4.6-mpi-gpu/bin:$PATH
. /share/apps/gromacs/4.6-mpi-gpu/bin/GMXRC.bash
lava.mvapich2.wrapper mdrun_mpi \
-testverlet -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc

==== matlab.sub ====



#!/bin/bash

rm -rf out err *.out

# from greentail we need to recreate module env
export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\
/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\
/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\
/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:\
/cm/shared/apps/cuda50/libs/current/bin:/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:\
/cm/shared/apps/mvapich2/gcc/64/1.6/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/sbin
export PATH=/share/apps/matlab/2013a/bin:$PATH
export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\
/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\
/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\
/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\
/cm/shared/apps/mvapich2/gcc/64/1.6/lib

#BSUB -o out
#BSUB -e err
#BSUB -N
#BSUB -J test

# GPU RUN: (1-4), queue mwgpu, n (1-4, matches gpu count), must run on one node
#BSUB -q mwgpu
#BSUB -n 1
#BSUB -R "rusage[gpu=1:mem=7000],span[hosts=1]"
# signal MATGPU is a gpu run
export MATGPU=1
lava.mvapich2.wrapper matlab -nodisplay  -r test

==== lava.mvampich2.wrapper ====



#!/bin/sh

# This is a copy of lava.openmpi.wrapper which came with lava OCS kit
# Trying to make it work with mvapich2
# -hmeij 13aug2013

#
#  Copyright (c) 2007 Platform Computing
#
# This script is a wrapper for openmpi mpirun
# it generates the machine file based on the hosts
# given to it by Lava.
#

# RLIMIT_MEMLOCK problem with libibverbs -hmeij
ulimit -l unlimited


usage() {
        cat < ${ARGLIST}
#T=`grep -- -machinefile ${ARGLIST} |wc -l`
T=`grep -- -hostfile ${ARGLIST} |wc -l`
if [ $T -gt 0 ]; then
    echo "Error:  Do not provide the machinefile for mpirun."
    echo "        It is generated automatically for you."
    exit -3
fi

# Make the open-mpi machine file
echo "${LSB_HOSTS}" > ${MACHFILE}.lst
tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE}

MPIRUN=`which --skip-alias mpirun_rsh`
#MPIRUN=/share/apps/openmpi/1.2+intel-9/bin/mpirun
#echo "executing: ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}"

# sanity checks number of processes 1-4
np=`wc -l ${MACHFILE} | awk '{print $1}'` 
if [ $np -lt 1 -o $np -gt 4 ]; then
    echo "Error:  Incorrect number of processes ($np)"
    echo "        -n can be an integer in the range of 1 to 4"
    exit -4
fi

# sanity check single node
nh=`cat ${MACHFILE} | sort -u | wc -l` 
if [ $nh -ne 1 ]; then
    echo "Error:  No host or more than one host specified ($nh)"
    exit -5
fi

# one host, one to four gpus
gpunp=`cat ${MACHFILE} | wc -l | awk '{print $1}'`
gpuhost=`cat ${MACHFILE} | sort -u | tr -d '\n'`
gpuid=( $(for i in `ssh $gpuhost gpu-free | sed "s/,/ /g"`; do echo $i; done | shuf | head -$gpunp) )
if [ $gpunp -eq 1 ]; then
        CUDA_VISIBLE_DEVICES=$gpuid
        echo "GPU allocation instance $gpuhost:$gpuid"
else
        gpuids=`echo ${gpuid[@]} | sed "s/ /,/g"`
        CUDA_VISIBLE_DEVICES="$gpuids"
        echo "GPU allocation instance $gpuhost:$CUDA_VISIBLE_DEVICES"
fi
# namd ignores this
export CUDA_VISIBLE_DEVICES
#debug# setid=`ssh $gpuhost echo $CUDA_VISIBLE_DEVICES | tr '\n' ' '`
#debug# echo "setid=$setid";


if [ -n "$GMXRC" ]; then
        # gromacs needs them from base 0, so gpu 2,3 is string 01
        if [ ${#gpuid[*]} -eq 1 ]; then
                gmxrc_gpus="0"
        elif [ ${#gpuid[*]} -eq 2 ]; then
                gmxrc_gpus="01"
        elif [ ${#gpuid[*]} -eq 3 ]; then
                gmxrc_gpus="012"
        elif [ ${#gpuid[*]} -eq 4 ]; then
                gmxrc_gpus="0123"
        fi

        if [ $GMXRC -eq 1 ]; then
                newargs=`echo ${MYARGS} | sed "s/mdrun/mdrun -gpu_id $gmxrc_gpus/g"`
                echo "executing: $newargs"
                $newargs
        elif [ $GMXRC -eq 2 ]; then
                newargs=`echo ${MYARGS} | sed "s/mdrun_mpi/mdrun_mpi -gpu_id $gmxrc_gpus/g"`
                echo "executing: ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp $newargs"
                ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp $newargs
        fi

elif [ -n "$MATGPU" ] && [ $MATGPU -eq 1 ]; then
        echo "executing: ${MYARGS}
        ${MYARGS}
elif [ -n "$CHARMRUN" ] && [ $CHARMRUN -eq 1 ]; then
        cat ${MACHFILE}.lst | tr '\/ ' '\r\n' | sed 's/^/host /g' > ${MACHFILE}
        echo "executing: charmrun $NAMD_DIR/namd2 +p$gpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}"
        charmrun $NAMD_DIR/namd2 +p$gpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}
else
        echo "executing: ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp ${MYARGS}"
        ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp ${MYARGS}
fi

exit $?

===== elim code =====



#!/usr/bin/perl

while (1) {

        $gpu = 0;
        $log = '';
        if (-e "/usr/local/bin/gpu-info" ) {
                $tmp = `/usr/local/bin/gpu-info | egrep "Tesla K20"`;
                @tmp = split(/\n/,$tmp);
                foreach $i (0..$#tmp) {
                        ($a,$b,$c,$d,$e,$f,$g) = split(/\s+/,$tmp[$i]);
                        if ( $f == 0 ) { $gpu = $gpu + 1; }
                        #print "$a $f $gpu\n";
                        $log .= "$f,";
                }
        }
        # nr_of_args name1 value1 
        $string = "1 gpu $gpu";

        $h = `hostname`; chop($h);
        $d = `date +%m/%d/%y_%H:%M:%S`; chop($d);
        foreach $i ('n33','n34','n35','n36','n37') {
                if ( "$h" eq "$i" ) {
                        `echo "$d,$log" >> /share/apps/logs/$h.gpu.log`;
                }
        }

        # you need the \n to flush -hmeij
        # you also need the space before the line feed -hmeij
        print "$string \n"; 
        # or use
        #syswrite(OUT,$string,1);

        # smaller than specified in lsf.shared
        sleep 10;

}

\\ **[[cluster:0|Back]]