This is an old revision of the document!
Jobs need to be submitted to the scheduler on host sharptail itself for now and will be dispatched to nodes n33-n37 in queue mwgpu. — Meij, Henk 2013/08/21 11:01
The process for submitting GPU jobs on the mwgpu queue is described below. We have made the initial assumption, until we get more data and some experience, to implement the following simplistic model:
An “elim” has been written for the Lava scheduler that reports the number of available idle GPUs. To write and set up an “elim” read this page: eLIM. We can view what the scheduler has access to using lsload
to observe the idle GPUs. After we submit the job, the scheduler reserves the GPU. That can be viewed with the command bhosts
. Note that the GPU could still be idle, it may take a bit of time for the code to spin up and actually use that GPU.
[hmeij@sharptail sharptail] lsload -l n33 HOST_NAME status r15s r1m r15m ut pg io ls it tmp swp mem gpu <--- n33 ok 25.0 26.1 26.0 80% 5.0 710 3 1464 72G 25G 30G 4.0 [hmeij@sharptail sharptail]$ bsub < run.gpu Job <23259> is submitted to queue <mwgpu>. [hmeij@sharptail sharptail]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 23259 hmeij RUN mwgpu sharptail n33 test Aug 15 [hmeij@sharptail sharptail]$ bhosts -l n33 HOST n33 STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW ok 100.00 - 28 5 5 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem gpu Total 0.1 0.1 0.1 64% 2.2 188 1 3028 72G 27G 56G 3.0 Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 6144M 1.0 <---
With gpu-info
we can view our running job. gpu-info
and gpu-free
are available http://ambermd.org/gpus/ (I had to hard code my GPU string information as they came in at 02,03,82&83, you can use deviceQuery to find them).
[hmeij@sharptail sharptail]$ ssh n33 gpu-info unloading gcc module ==================================================== Device Model Temperature Utilization ==================================================== 0 Tesla K20m 25 C 0 % 1 Tesla K20m 27 C 0 % 2 Tesla K20m 32 C 99 % 3 Tesla K20m 21 C 0 % ====================================================
This is obviously our job running on GPU instance '2'. But Amber reports that GPU instance '0' is being used. Here is what mdout reports:
| CUDA Capable Devices Detected: 1 | CUDA Device ID in use: 0
The reason for this is that CUDA_VISIBLE_DEVICES is used in the wrapper program to mask the GPU instance IDs. So the job got instructed that a GPU is available and it is the first one, thus '0'. So if all the GPU are running it's hard to find your job. So inside the wrapper we can grab the real GPU instance ID and report it to standard out. For this job the STDOUT reports that just before MPIRUN is started. (With Lava look for STDOUT in a file called ~/.lsbatch/[0-9]*.LSF_JOBPID.out).
GPU allocation instance n33:2 executing: /cm/shared/apps/mvapich2/gcc/64/1.6/bin/mpirun_rsh -ssh -hostfile \ /home/hmeij/.lsbatch/mpi_machines -np 1 pmemd.cuda.MPI \ -O -o mdout.23259 -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd
The code that assigns and inserts and handles the CUDA_VISIBLE_DEVICES is shown in the lava.mvapich2.wrapper section below.
The program below shows examples of how to run Amber and Lammps jobs. Please note that you should always reserve a GPU (gpu=1) otherwise jobs may crash and GPUs can become over committed. You may wish to perform a cpu run only for comparisons (gpu=0) or debugging purposes (but will be limited to a max of 4 job slots per node). Do not launch GPU enabled software during such a run. pmemd.cuda.MPI is GPU enable so instead invoke pmemd.MPI. In the case of Lammps you can toggle between GPU enabled software and MPI enabled software using a environment variable. Here is the header of the Lammps input file
# Enable GPU code if variable is set. if "(${GPUIDX} > 0)" then & "suffix gpu" & "newton off" & "package gpu force 0 ${GPUIDX} 1.0" # "package gpu force 0 ${GPUIDX} 1.0 threads_per_atom 2" echo both
NAMD works slightly different again and has everything build in, also it's hostfile is a bit different. The wrapper will set up the environment and invoke NAMD with several preset flags and then add whatever arguments you provide. Again, -n will match the number of GPUs allocated on a single node. For example
GPU allocation instance n37:1,3,0 charmrun /cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02//namd2 \ +p3 ++nodelist /home/hmeij/.lsbatch/mpi_machines +idlepoll +devices 1,3,0 \ apoa1/apoa1.namd
#!/bin/bash # submit via 'bsub < run.gpu' rm -f mdout.[0-9]* auout.[0-9]* apoa1out.[0-9]* #BSUB -e err #BSUB -o out #BSUB -q mwgpu #BSUB -J test ## leave sufficient time between job submissions (30-60 secs) ## the number of GPUs allocated matches -n value automatically ## always reserve GPU (gpu=1), setting this to 0 is a cpu job only ## reserve 6144 MB (5 GB + 20%) memory per GPU ## run all processes (1<=n<=4)) on same node (hosts=1). #BSUB -n 1 #BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]" # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH cd $MYSANSCRATCH # AMBER # stage the data cp -r ~/sharptail/* . # feed the wrapper lava.mvapich2.wrapper pmemd.cuda.MPI \ -O -o mdout.$LSB_JOBID -inf mdinfo.1K10 -x mdcrd.1K10 -r restrt.1K10 -ref inpcrd # save results cp mdout.$LSB_JOBID ~/sharptail/ # LAMMPS # GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp) export GPUIDX=1 # stage the data cp -r ~/sharptail/* . # feed the wrapper lava.mvapich2.wrapper lmp_nVidia \ -c off -var GPUIDX $GPUIDX -in au.inp -l auout.$LSB_JOBID # save results cp auout.$LSB_JOBID ~/sharptail/ # NAMD # signal that this is charmrun/namd job export CHARMRUN=1 # stage the data cp -r ~/sharptail/* . # feed the wrapper lava.mvapich2.wrapper \ apoa1/apoa1.namd > apoa1out.$LSB_JOBID # save results cp apoa1out.$LSB_JOBID ~/sharptail/
#!/bin/bash rm -rf gromacs.out gromacs.err \#* *.log # from greentail we need to recreate module env export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:/cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:/usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:/cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:/cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/sbin export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:/cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:/cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:/cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:/cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:/cm/shared/apps/mvapich2/gcc/64/1.6/lib #BSUB -o gromacs.out #BSUB -e gromacs.err #BSUB -N #BSUB -J 325monolayer # read /share/apps/gromacs/build.sh . /share/apps/intel/composerxe/bin/iccvars.sh intel64 export VMDDIR=/share/apps/vmd/1.8.6 ## CPU RUN: queue mw256, n<=28, must run on one node (thread_mpi) ##BSUB -q mw256 ##BSUB -n 2 ##BSUB -R "rusage[gpu=0],span[hosts=1]" #export PATH=/share/apps/gromacs/4.6-icc-gpu/bin:$PATH #. /share/apps/gromacs/4.6-icc-gpu/bin/GMXRC.bash #mdrun -nt 2 -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc ## GPU RUN: gpu (1-4), queue mwgpu, n (1-4, matches gpu count), must run on one node ##BSUB -q mwgpu ##BSUB -n 1 ##BSUB -R "rusage[gpu=1],span[hosts=1]" ## signal GMXRC is a gpu run with: 1=thread_mpi #export GMXRC=1 #export PATH=/share/apps/gromacs/4.6-icc-gpu/bin:$PATH #. /share/apps/gromacs/4.6-icc-gpu/bin/GMXRC.bash #lava.mvapich2.wrapper mdrun \ #-testverlet -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc # GPU RUN: gpu (1-4), queue mwgpu, n (1-4, matches gpu count), must run on one node #BSUB -q mwgpu #BSUB -n 1 #BSUB -R "rusage[gpu=1],span[hosts=1]" # signal GMXRC is a gpu run with: 2=mvapich2 export GMXRC=2 export PATH=/share/apps/gromacs/4.6-mpi-gpu/bin:$PATH . /share/apps/gromacs/4.6-mpi-gpu/bin/GMXRC.bash lava.mvapich2.wrapper mdrun_mpi \ -testverlet -s 325topol.tpr -c 325monolayer.gro -e 325ener.edr -o 325traj.trr -x 325traj.xtc
#!/bin/sh # This is a copy of lava.openmpi.wrapper which came with lava OCS kit # Trying to make it work with mvapich2 # -hmeij 13aug2013 # # Copyright (c) 2007 Platform Computing # # This script is a wrapper for openmpi mpirun # it generates the machine file based on the hosts # given to it by Lava. # # RLIMIT_MEMLOCK problem with libibverbs -hmeij ulimit -l unlimited usage() { cat <<USEEOF USAGE: $0 This command is a wrapper for mpirun (openmpi). It can only be run within Lava using bsub e.g. bsub -n # "$0 -np # {my mpi command and args}" The wrapper will automatically generate the machinefile used by mpirun. NOTE: The list of hosts cannot exceed 4KBytes. USEEOF } if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then usage exit -1 fi MYARGS=$* WORKDIR=`dirname ${LSB_JOBFILENAME}` MACHFILE=${WORKDIR}/mpi_machines ARGLIST=${WORKDIR}/mpi_args # Check if mpirun is in the PATH -hmeij T=`which --skip-alias mpirun_rsh` #T=`which mpirun_rsh` if [ $? -ne 0 ]; then echo "Error: mpirun_rsh is not in your PATH." exit -2 fi echo "${MYARGS}" > ${ARGLIST} #T=`grep -- -machinefile ${ARGLIST} |wc -l` T=`grep -- -hostfile ${ARGLIST} |wc -l` if [ $T -gt 0 ]; then echo "Error: Do not provide the machinefile for mpirun." echo " It is generated automatically for you." exit -3 fi # Make the open-mpi machine file echo "${LSB_HOSTS}" > ${MACHFILE}.lst tr '\/ ' '\r\n' < ${MACHFILE}.lst > ${MACHFILE} MPIRUN=`which --skip-alias mpirun_rsh` #MPIRUN=/share/apps/openmpi/1.2+intel-9/bin/mpirun #echo "executing: ${MPIRUN} -x LD_LIBRARY_PATH -machinefile ${MACHFILE} ${MYARGS}" # sanity checks number of processes 1-4 np=`wc -l ${MACHFILE} | awk '{print $1}'` if [ $np -lt 1 -o $np -gt 4 ]; then echo "Error: Incorrect number of processes ($np)" echo " -n can be an integer in the range of 1 to 4" exit -4 fi # sanity check single node nh=`cat ${MACHFILE} | sort -u | wc -l` if [ $nh -ne 1 ]; then echo "Error: No host or more than one host specified ($nh)" exit -5 fi # one host, one to four gpus gpunp=`cat ${MACHFILE} | wc -l | awk '{print $1}'` gpuhost=`cat ${MACHFILE} | sort -u | tr -d '\n'` gpuid=( $(for i in `ssh $gpuhost gpu-free | sed "s/,/ /g"`; do echo $i; done | shuf | head -$gpunp) ) if [ $gpunp -eq 1 ]; then CUDA_VISIBLE_DEVICES=$gpuid echo "GPU allocation instance $gpuhost:$gpuid" else gpuids=`echo ${gpuid[@]} | sed "s/ /,/g"` CUDA_VISIBLE_DEVICES="$gpuids" echo "GPU allocation instance $gpuhost:$CUDA_VISIBLE_DEVICES" fi # namd ignores this export CUDA_VISIBLE_DEVICES #debug# setid=`ssh $gpuhost echo $CUDA_VISIBLE_DEVICES | tr '\n' ' '` #debug# echo "setid=$setid"; if [ $CHARMRUN -eq 1 ]; then cat ${MACHFILE}.lst | tr '\/ ' '\r\n' | sed 's/^/host /g' > ${MACHFILE} echo "executing: charmrun $NAMD_DIR/namd2 +p$gpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS}" charmrun $NAMD_DIR/namd2 +p$gpunp ++nodelist ${MACHFILE} +idlepoll +devices $CUDA_VISIBLE_DEVICES ${MYARGS} else echo "executing: ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp ${MYARGS}" ${MPIRUN} -ssh -hostfile ${MACHFILE} -np $gpunp ${MYARGS} fi exit $?
**Back