User Tools

Site Tools


cluster:223

Warning: Undefined array key -1 in /usr/share/dokuwiki/inc/html.php on line 1458

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cluster:223 [2023/09/07 08:45]
hmeij07
cluster:223 [2023/09/18 16:56] (current)
hmeij07
Line 47: Line 47:
  
 # update link to this version: yes # update link to this version: yes
-# no -silent...+# no -silent -driver ...
  
 =========== ===========
Line 116: Line 116:
  
  
-==== Test ====+ 
 +==== Testing ==== 
 + 
  
 Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with Script ~hmeij/slurm/run.centos, cuda 11.2, pmemd.cuda of local install of amber20 with
Line 125: Line 128:
   * #SBATCH --mem-per-gpu=7168   * #SBATCH --mem-per-gpu=7168
  
-For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd.+For some reason this yields cpus=8 which is different behavior (expected cpu=1). Slurm is overriding the above settings with partition setting of DefCpuPerGPU=8. Slurm has not changed but cuda version has. Odd. Good news is Amber runs fine, no need to recompile.
  
 <code> <code>
Line 147: Line 150:
  
 </code> </code>
 +
 +  * #SBATCH --cpus-per-gpu=1
 +
 +Adding this does force Slurm to allocate just a single cpu. Now try 4 gpu jobs per node. No need for CUDA_VISIBLE_DEVICES setting.
 +
 +<code>
 +
 +JOBID   PARTITION         NAME          USER  ST          TIME NODES  CPUS    MIN_MEMORY NODELIST(REASON)
 +1053992 mwgpu             test         hmeij            0:04                                  n33
 +
 +[hmeij@cottontail2 slurm]$ for i in `seq 1 6`; do sbatch run.centos; sleep 30; squeue | grep hmeij; done
 +
 +# output
 +Submitted batch job 1054000
 +1054000 mwgpu             test         hmeij            0:30                                  n33
 +Submitted batch job 1054001
 +1054001 mwgpu             test         hmeij            0:30                                  n33
 +1054000 mwgpu             test         hmeij            1:00                                  n33
 +Submitted batch job 1054002
 +1054002 mwgpu             test         hmeij            0:30                                  n33
 +1054001 mwgpu             test         hmeij            1:00                                  n33
 +1054000 mwgpu             test         hmeij            1:30                                  n33
 +Submitted batch job 1054003
 +1054003 mwgpu             test         hmeij            0:30                                  n33
 +1054002 mwgpu             test         hmeij            1:00                                  n33
 +1054001 mwgpu             test         hmeij            1:30                                  n33
 +1054000 mwgpu             test         hmeij            2:00                                  n33
 +Submitted batch job 1054004
 +1054004 mwgpu             test         hmeij  PD          0:00                          (Resources)
 +1054003 mwgpu             test         hmeij            1:00                                  n33
 +1054002 mwgpu             test         hmeij            1:30                                  n33
 +1054001 mwgpu             test         hmeij            2:00                                  n33
 +1054000 mwgpu             test         hmeij            2:30                                  n33
 +Submitted batch job 1054005
 +1054005 mwgpu             test         hmeij  PD          0:00                     0(Nodes required f
 +1054004 mwgpu             test         hmeij  PD          0:00                          (Resources)
 +1054003 mwgpu             test         hmeij            1:30                                  n33
 +1054002 mwgpu             test         hmeij            2:00                                  n33
 +1054001 mwgpu             test         hmeij            2:30                                  n33
 +1054000 mwgpu             test         hmeij            3:00                                  n33
 +
 +
 +[hmeij@cottontail2 slurm]$ ssh n33 gpu-info
 +id,name,temp.gpu,mem.used,mem.free,util.gpu,util.mem
 +0, Tesla K20m, 40, 95 MiB, 4648 MiB, 100 %, 25 %
 +1, Tesla K20m, 40, 95 MiB, 4648 MiB, 94 %, 23 %
 +2, Tesla K20m, 35, 95 MiB, 4648 MiB, 93 %, 21 %
 +3, Tesla K20m, 28, 95 MiB, 4648 MiB, 97 %, 25 %
 +
 +</code>
 +
 +Other software does need to be recompiled as it links to specific version of libraries rather than the generic libName.so (lammps).
 +
 +Script ~hmeij/slurm/run.centos.lammps, setup env, get help page.
 +
 +<code>
 +
 +/share/apps/intel/parallel_studio_xe_2016_update3/compilers_and_libraries_2016.3.210/linux/bin/intel64/ifort
 +/usr/local/cuda/bin/nvcc
 +/share/apps/CENTOS7/openmpi/4.0.4/bin/mpirun
 +/share/apps/CENTOS7/python/3.8.3/bin/python
 +/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/lmp_mpi-cuda-single-single
 +        linux-vdso.so.1 =>  (0x00007ffd714ec000)
 +        libjpeg.so.62 => /lib64/libjpeg.so.62 (0x00007fe443b9a000)
 +        libcudart.so.11.0 => /usr/local/cuda/lib64/libcudart.so.11.0 (0x00007fe44390b000)
 +        libcuda.so.1 => /lib64/libcuda.so.1 (0x00007fe442223000)
 +        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007fe436a74000)
 +        libdl.so.2 => /lib64/libdl.so.2 (0x00007fe436870000)
 +        libmpi.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libmpi.so.40 (0x00007fe43655b000)
 +        libstdc++.so.6 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libstdc++.so.6 (0x00007fe4361d9000)
 +        libm.so.6 => /lib64/libm.so.6 (0x00007fe435ed7000)
 +        libgcc_s.so.1 => /share/apps/CENTOS7/gcc/6.5.0/lib64/libgcc_s.so.1 (0x00007fe435cc0000)
 +        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fe435aa4000)
 +        libc.so.6 => /lib64/libc.so.6 (0x00007fe4356d6000)
 +        /lib64/ld-linux-x86-64.so.2 (0x00007fe443def000)
 +        librt.so.1 => /lib64/librt.so.1 (0x00007fe4354ce000)
 +        libopen-rte.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-rte.so.40 (0x00007fe435218000)
 +        libopen-pal.so.40 => /share/apps/CENTOS7/openmpi/4.0.4/lib/libopen-pal.so.40 (0x00007fe434f09000)
 +        libutil.so.1 => /lib64/libutil.so.1 (0x00007fe434d06000)
 +        libz.so.1 => /lib64/libz.so.1 (0x00007fe434af0000)
 +
 +Large-scale Atomic/Molecular Massively Parallel Simulator - 28 Mar 2023 - Development
 +
 +Usage example: lmp_mpi-cuda-single-single -var t 300 -echo screen -in in.alloy
 +
 +List of command line options supported by this LAMMPS executable:
 +<snip>
 +
 +# hmmm, using -suffix gpu it does not jump on gpus, generic non-gpu libthread error
 +# same version rocky8/cuda-11.6 works, centos7/cuda-10.2 works, all "make" compiles
 +# try "cmake" compile on n33-n36 
 +# libspace tarball download fails on file hash and 
 +# yields a  status: [1;"Unsupported protocol" error for ML-PACE
 +
 +# without ML-SPACE hash fails for opencl-loarder third partty, bad url
 +# https://download.lammps.org/thirdparty/opencl-loader-opencl-loadewer-version...tgz
 +# then extract in _deps/ dir
 +# and added -D GPU_LIBRARY=../lib/gpu/libgpu.a ala QUIP_LIBRARY
 +# that works, cmake compile binary jumps on multiple gpus
 +
 +
 +[hmeij@n35 sharptail]$ mpirun -n 2 \
 +/share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp \
 +-suffix gpu -in in.colloid 
 +
 +[root@greentail52 ~]# ssh n35 gpu-process
 +gpu_name, gpu_id, pid, process_name
 +Tesla K20m, 0, 9911, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
 +Tesla K20m, 1, 9912, /share/apps/CENTOS7/lammps/25Apr2023/cuda-11.2/cmake/single-single/lmp
 +
 +# some stats, colloid example
 +
 +1 cpu, 1 gpu
 +Total wall time: 0:05:49
 +2 cpus, 2 gpus
 +Total wall time: 0:03:58
 +4 cpus, 4 gpus
 +Total wall time: 0:02:23
 +8 cpus, 4 gpus
 +Total wall time: 0:02:23
 +
 +# but the ML-PACE hash  error is different, so no go there 
 +
 +</code>
 +
  
 \\ \\
 **[[cluster:0|Back]]** **[[cluster:0|Back]]**
cluster/223.1694090708.txt.gz · Last modified: 2023/09/07 08:45 by hmeij07