\\ **[[cluster:0|Back]]** ==== Comments ==== Here are some useful comments from lists/vendors etc * //LAMMPS uses a hybrid OpenMP/MPI model. If you don't set the number of OpenMP threads (ompthreads or OMP_NUM_THREADS) explicitly, it will likely take the number of CPU cores (ncpus) as its default value and you will end up with having too many OpenMP threads and MPI processes on a physical core. You can see this by logging in to the compute node and do "top".// * Note: I will have to verify this, did not pay attention to OMP. * LAMMPS developer here: 95.8% CPU use with 8 MPI tasks x no OpenMP threads * You can trust the output. There are no OpenMP threads. While you added USER-OMP, the "no OpenMP" means you didn't compile with -fopenmp in the first place. * //Do you observe the new CPU boosting when under load? I remember we had to adjust the CPU governor for our Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz. The CentOS 7 default did something silly like use power-save.// * Note: Need to pay attention to this too, really... [root@n79 ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor powersave ... # From lammps developer... # Try switching to the performance governor with tuned-adm. /usr/lib/tuned/ondemand/tuned.conf: [main] include=throughput-performance [cpu] governor=performance $ tuned-adm profile performance ==== Openmpi Problem ==== We encountered a performance issue with Openmpi and are unsure what is going on. So this page details that problem so I can have some folks look at it and perhaps suggest a resolution. The Openmpi version is 4.0.4 and is basically a vanilla compilation with only ''--prefix='' specified pointing to a location where we store software ''/share'', an HPC wide NFS mount. export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH The Lammps version is 3Mar2020 and is compiled for CPU runs only using the ''src/MAKE/Makefile.mpi'' file with a few edits * ''std=c++11'' is added for compiler flags * -DLAMMPS_JPEG is added to LMP_INC, and the following is specified * JPG_INC = -I/usr/include * JPG_PATH = -L/usr/lib64 * JPG_LIB = -ljpeg * with these packages make purge make yes-rigid make yes-colloid make yes-class2 make yes-kspace make yes-misc make yes-molecule make yes-python make yes-user-omp make package-update Then plain ''make mpi''. ==== greentail52 ==== Both software applications were compiled on host ''greentail52'', our scratch space server. # dual 4-core cpus model name : Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 (lscpu -p) 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,4,1,1,,4,4,4,1 5,5,1,1,,5,5,5,1 6,6,1,1,,6,6,6,1 7,7,1,1,,7,7,7,1 # running CentOS Linux release 7.6.1810 (Core) Linux greentail52 3.10.0-957.5.1.el7.x86_64 ==== Testing ==== Lammps binary was copied to ''/tmp/foo/'' but Openmpi was left in NFS share ''/share/''. The input data is the basic colloid example. Results using no mpi, then mpirun with -np [1,2,4,8]. As expected no mpirun and mpirun -n 1 are near identical. Run times improves from 7 1/2 minutes to about 1 1/2 minute using the physical cpu cores. Or a 5x improvement using this server. # no mpirun invocation time ./lmp_mpi -suffix cpu -in in.colloid > out.colloid & Total wall time: 0:07:30 Performance: 479162.946 tau/day, 1109.173 timesteps/s 100.0% CPU use with 1 MPI tasks x no OpenMP threads # similar to above one worker time mpirun -n 1 --nooversubscribe --bind-to core --report-bindings \ --host localhost \ ./lmp_mpi -suffix cpu -in in.colloid > out.colloid & [greentail52:217279] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..] Total wall time: 0:07:29 Performance: 480542.270 tau/day, 1112.366 timesteps/s 100.0% CPU use with 1 MPI tasks x no OpenMP threads # two workers time mpirun -n 2 --nooversubscribe --bind-to core --report-bindings \ --host localhost,localhost \ ./lmp_mpi -suffix cpu -in in.colloid > out.colloid & [greentail52:217517] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..] [greentail52:217517] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..] Total wall time: 0:03:57 Performance: 908664.629 tau/day, 2103.390 timesteps/s 98.2% CPU use with 2 MPI tasks x no OpenMP threads # four workers time mpirun -n 4 --nooversubscribe --bind-to core --report-bindings \ --host localhost,localhost,localhost,localhost \ ./lmp_mpi -suffix cpu -in in.colloid > out.colloid & [greentail52:217763] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..] [greentail52:217763] MCW rank 1 bound to socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..] [greentail52:217763] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..] [greentail52:217763] MCW rank 3 bound to socket 1[core 5[hwt 0-1]]: [../../../..][../BB/../..] Total wall time: 0:02:16 Performance: 1584289.072 tau/day, 3667.336 timesteps/s 96.1% CPU use with 4 MPI tasks x no OpenMP threads # eight workers time mpirun -n 8 --nooversubscribe --bind-to core --report-bindings \ --host localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost \ ./lmp_mpi -suffix cpu -in in.colloid > out.colloid & [1] 217864 [greentail52:217865] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..] [greentail52:217865] MCW rank 1 bound to socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..] [greentail52:217865] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..] [greentail52:217865] MCW rank 3 bound to socket 1[core 5[hwt 0-1]]: [../../../..][../BB/../..] [greentail52:217865] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/..][../../../..] [greentail52:217865] MCW rank 5 bound to socket 1[core 6[hwt 0-1]]: [../../../..][../../BB/..] [greentail52:217865] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB][../../../..] [greentail52:217865] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [../../../..][../../../BB] Total wall time: 0:01:21 Performance: 2648317.900 tau/day, 6130.366 timesteps/s 95.8% CPU use with 8 MPI tasks x no OpenMP threads ==== n79 ==== Our latest hardware purchase from a company in CA, a dozen of these. # dual 12-core cpus model name : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz # running CentOS Linux release 7.7.1908 (Core) Linux n79 3.10.0-1062.9.1.el7.x86_64 ==== New hardware Testing ==== Same setup as above, same sequence, but strange results! We recompiled the binary on node ''n79'' and ran it from ''/tmp/foo'' making sure we got physical cpu cores allocated. But no mpirun and mpirun -n 1 are very different. There is a serious performance degradation but why? Adding workers improves the results but we never get anything decent till we allocated 8+ worker, but even then, we run almost 2x slower than on host ''greentail52''. What could explain this? # no mpirun time ./lmp_mpi-n79 .... Total wall time: 0:07:43 Performance: 465951.274 tau/day, 1078.591 timesteps/s 99.7% CPU use with 1 MPI tasks x no OpenMP threads # 1 worker time mpirun ... ./lmp_mpi-n79 ... Total wall time: 0:11:33 Performance: 311317.460 tau/day, 720.642 timesteps/s 99.7% CPU use with 1 MPI tasks x no OpenMP threads # 2 workers time mpirun -n 2 ... ./lmp_mpi-n79 ... Total wall time: 0:06:02 Performance: 595262.663 tau/day, 1377.923 timesteps/s 97.9% CPU use with 2 MPI tasks x no OpenMP threads # 4 workers time mpirun -n 4 ... ./lmp_mpi-n79 ... Total wall time: 0:03:44 Performance: 960223.978 tau/day, 2222.741 timesteps/s 96.5% CPU use with 4 MPI tasks x no OpenMP threads # eight workers time mpirun -n 8 ... ./lmp_mpi-n79 ... Total wall time: 0:02:15 Performance: 1600344.593 tau/day, 3704.501 timesteps/s 96.6% CPU use with 8 MPI tasks x no OpenMP threads ==== Summary ==== Minutes processing times of multiple runs for the colloid lammps example. ^ ^ greentail52 ^ n79 ^ n79 perf ^ n79 hpc-perf ^ | no mpirun | 7:30 | 7:43 | 0:07:08 | 0:08:22 | | mpirun -n 1 | 7:29 | 11:33 | 0:07:12 | 0:08:23 | | mpirun -n 2 | 3:57 | 6:02 | 0:03:45 | 0:04:19 | | mpirun -n 4 | 2:16 | 3:44 | 0:02:09 | 0:02:28 | | mpirun -n 8 | 1:21 | 2:15 | 0:01:20 | 0:01:27 | | mpirun -n 12 | | | 0:01:08 | 0:01:11 | | mpirun -n 24 | | | 0:00:49 | 0:00:49 | The "n79 perf" column are the results after I create a tuned profile named "performance" and then switch to that profile using ''tuned-adm'' and echo the string "performance" in relevant ''/sys'' files. Also **does not matter** if I run the lmp_mpi version (compiled on ''greentail'') or lmp_mpi-n79 (compiled on ''n79''). Yea. The ""n79 hpc-perf" column are the results after I splice the ''/usr/lib/tuned/hpc-compute/tuned.conf'' settings into my "performance" profile above. Contains a lot of memory settings. We swap out "throughput-performance" for "latency-performance" and make sure the section ''[cpu]'' still points to performance. Going with the short version, comment out all other options. We can enable when needed. # original [main] include=throughput-performance [cpu] governor=performance # current files 1011 for i in \ `ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`;\ do echo performance > $i; done 1012 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # holds on reboot? yes 1013 cd /usr/lib/tuned/ 1014 mkdir performance 1015 vi performance/tuned.conf 1016 tuned-adm profile 1017 tuned-adm profile performance 1018 tuned-adm active \\ **[[cluster:0|Back]]**