\\ **[[cluster:0|Back]]** ===== Lammps GPU Testing (EC) ===== * 32 cores E2660 * 4 K20 GPU * workstation * MPICH2 flavor Same tests (12 cpu cores) using lj/cut, eam, lj/expand, and morse: **AU.reduced** CPU only 6 mins 1 secs 1 GPU 1 mins 1 secs (a 5-6 times speed up) 2 GPUs 1 mins 0 secs (never saw 2nd GPU used, problem set too small?) Same tests (12 cpu cores) using a restart file and using gayberne: **GB** CPU only 1 hour 5 mins 1 GPU 5 mins and 15 secs (a 18-19 times peed up) 2 GPUs 2 mins Above results seems overall a bit slower that at other vendor, but same pattern. Francis's Melt problem set ^3d Lennard-Jones melt: for 10,000 steps with 32,000 atoms^^^^^^ |CPU only| -np 1 | -np 6 | -np 12 | -np 24 | -np 36 | |loop times| 329s | 63s | 39s | 29s | 45s | |GPU only| 1xK20 | 2xK20 | 3xK20 | 4xK20 | (-np 1-4) | |loop times| 28s | 16s | 11s | 10s | | ^3d Lennard-Jones melt: for 100,000 steps with 32,000 atoms^^^^^^ |GPU only| 1xK20 | 2xK20 | 3xK20 | 4xK20 | (-np 1-4) | |loop times| 274s | 162s | 120s | 98s | | * Serial's time of 329s is reduced to 29s for MPI, an 11x speed up * GPU's serial time matches MPI -np 24 and can be further reduced to 10s, a 3x speed up ==== Redoing Above ==== **10/16/2013** Redoing the melt problem now on our own K20 hardware I get the following (observing with gpu-info that utilization runs about 20-25% on the GPU allocated) Loop time of 345.936 on 1 procs for 100000 steps with 32000 atoms #!/bin/bash # submit via 'bsub < run.gpu' rm -f log.lammps melt.log #BSUB -e err #BSUB -o out #BSUB -q mwgpu #BSUB -J test ## leave sufficient time between job submissions (30-60 secs) ## the number of GPUs allocated matches -n value automatically ## always reserve GPU (gpu=1), setting this to 0 is a cpu job only ## reserve 6144 MB (5 GB + 20%) memory per GPU ## run all processes (1<=n<=4)) on same node (hosts=1). #BSUB -n 1 #BSUB -R "rusage[gpu=1:mem=6144],span[hosts=1]" # from greentail we need to recreate module env export PATH=/home/apps/bin:/cm/local/apps/cuda50/libs/304.54/bin:\ /cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/lammps/cuda/2013-01-27/:\ /cm/shared/apps/amber/amber12/bin:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\ /usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:\ /usr/sbin:/cm/shared/apps/cuda50/toolkit/5.0.35/bin:\ /cm/shared/apps/cuda50/sdk/5.0.35/bin/linux/release:/cm/shared/apps/cuda50/libs/current/bin:\ /cm/shared/apps/cuda50/toolkit/5.0.35/open64/bin:/cm/shared/apps/mvapich2/gcc/64/1.6/bin:\ /cm/shared/apps/mvapich2/gcc/64/1.6/sbin export PATH=/share/apps/bin:$PATH export LD_LIBRARY_PATH=/cm/local/apps/cuda50/libs/304.54/lib64:\ /cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/amber/amber12/lib:\ /cm/shared/apps/amber/amber12/lib64:/cm/shared/apps/namd/ibverbs-smp-cuda/2013-06-02/:\ /cm/shared/apps/cuda50/toolkit/5.0.35/lib64:/cm/shared/apps/cuda50/libs/current/lib64:\ /cm/shared/apps/cuda50/toolkit/5.0.35/open64/lib:\ /cm/shared/apps/cuda50/toolkit/5.0.35/extras/CUPTI/lib:\ /cm/shared/apps/mvapich2/gcc/64/1.6/lib # unique job scratch dirs MYSANSCRATCH=/sanscratch/$LSB_JOBID MYLOCALSCRATCH=/localscratch/$LSB_JOBID export MYSANSCRATCH MYLOCALSCRATCH cd $MYSANSCRATCH # LAMMPS # GPUIDX=1 use allocated GPU(s), GPUIDX=0 cpu run only (view header au.inp) export GPUIDX=1 # stage the data cp ~/gpu_testing/fstarr/lj/* . # feed the wrapper lava.mvapich2.wrapper lmp_nVidia \ -c off -var GPUIDX $GPUIDX -in in.melt # save results cp log.lammps melt.log ~/gpu_testing/fstarr/lj/ ===== Lammps GPU Testing (MW) ===== Vendor: "There are currently two systems available, each with two 8-core Xeon E5-2670 processors, 32GB memory, 120GB SSD and two Tesla K20 GPUs. The hostnames are master and node2. You will see that a GPU-accelerated version of LAMMPS with MPI support is installed in /usr/local/LAMMPS." Actually, turns out there are 32 cores on node so I suspect four CPUs. First, we expose the GPUs to Lammps (so running with a value of -1 ignores the GPUs) in our input file. # Enable GPU's if variable is set. if "(${GPUIDX} >= 0)" then & "suffix gpu" & "newton off" & "package gpu force 0 ${GPUIDX} 1.0" Then we invoke the Lammps executable with MPI. NODES=1 # number of nodes [=>1] GPUIDX=0 # GPU indices range from [0,1], this is the upper bound. # set GPUIDX=0 for 1 GPU/node or GPUIDX=1 for 2 GPU/node CORES=12 # Cores per node. (i.e. 2 CPUs with 6 cores ea =12 cores per node) which mpirun echo "*** GPU run with one MPI process per core ***" date mpirun -np $((NODES*CORES)) -bycore ./lmp_ex1 -c off -var GPUIDX $GPUIDX \ -in film.inp -l film_1_gpu_1_node.log date Some tests using **lj/cut**, **eam**, **lj/expand**, and **morse**: * CPU only 4 mins 30 secs * 1 GPU 0 mins 47 secs (a 5-6 times speed up) * 2 GPUs 0 mins 46 secs (never saw 2nd GPU used, problem set too small?) Some tests using a restart file and using **gayberne**, * CPU only 1 hour 5 mins * 1 GPU 3 mins and 33 secs (a 18-19 times peed up) * 2 GPUs 2 mins (see below) node2$ gpu-info ==================================================== Device Model Temperature Utilization ==================================================== 0 Tesla K20m 36 C 96 % 1 Tesla K20m 34 C 92 % ==================================================== \\ **[[cluster:0|Back]]**