DokuWiki

Comments

Here are some useful comments from lists/vendors etc

LAMMPS uses a hybrid OpenMP/MPI model. If you don't set the number of OpenMP threads (ompthreads or OMP_NUM_THREADS) explicitly, it will likely take the number of CPU cores (ncpus) as its default value and you will end up with having too many OpenMP threads and MPI processes on a physical core. You can see this by logging in to the compute node and do “top”.
- Note: I will have to verify this, did not pay attention to OMP.
- LAMMPS developer here:

95.8% CPU use with 8 MPI tasks x no OpenMP threads

You can trust the output. There are no OpenMP threads. While you added USER-OMP, the “no OpenMP” means you didn't compile with -fopenmp in the first place.

Do you observe the new CPU boosting when under load? I remember we had to adjust the CPU governor for our Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz. The CentOS 7 default did something silly like use power-save.
- Note: Need to pay attention to this too, really…

[root@n79 ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
...
# From lammps developer...
# Try switching to the performance governor with tuned-adm.

/usr/lib/tuned/ondemand/tuned.conf:
[main]
include=throughput-performance

[cpu]
governor=performance

$ tuned-adm profile performance

Openmpi Problem

We encountered a performance issue with Openmpi and are unsure what is going on. So this page details that problem so I can have some folks look at it and perhaps suggest a resolution.

The Openmpi version is 4.0.4 and is basically a vanilla compilation with only –prefix= specified pointing to a location where we store software /share, an HPC wide NFS mount.

export PATH=/share/apps/CENTOS7/openmpi/4.0.4/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS7/openmpi/4.0.4/lib:$LD_LIBRARY_PATH

export PATH=/share/apps/CENTOS6/python/2.7.9/bin:$PATH
export LD_LIBRARY_PATH=/share/apps/CENTOS6/python/2.7.9/lib:$LD_LIBRARY_PATH

The Lammps version is 3Mar2020 and is compiled for CPU runs only using the src/MAKE/Makefile.mpi file with a few edits

std=c++11 is added for compiler flags
-DLAMMPS_JPEG is added to LMP_INC, and the following is specified
- JPG_INC = -I/usr/include
- JPG_PATH = -L/usr/lib64
- JPG_LIB = -ljpeg
with these packages

make purge
make yes-rigid
make yes-colloid
make yes-class2
make yes-kspace
make yes-misc
make yes-molecule
make yes-python
make yes-user-omp
make package-update

Then plain make mpi.

greentail52

Both software applications were compiled on host greentail52, our scratch space server.

# dual 4-core cpus
model name      : Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz

# CPU,Core,Socket,Node,,L1d,L1i,L2,L3 (lscpu -p)
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,4,1,1,,4,4,4,1
5,5,1,1,,5,5,5,1
6,6,1,1,,6,6,6,1
7,7,1,1,,7,7,7,1


# running
CentOS Linux release 7.6.1810 (Core)
Linux greentail52 3.10.0-957.5.1.el7.x86_64

Testing

Lammps binary was copied to /tmp/foo/ but Openmpi was left in NFS share /share/. The input data is the basic colloid example. Results using no mpi, then mpirun with -np [1,2,4,8].

As expected no mpirun and mpirun -n 1 are near identical. Run times improves from 7 1/2 minutes to about 1 1/2 minute using the physical cpu cores. Or a 5x improvement using this server.

# no mpirun invocation
time ./lmp_mpi -suffix cpu -in in.colloid > out.colloid &
Total wall time: 0:07:30
Performance: 479162.946 tau/day, 1109.173 timesteps/s
100.0% CPU use with 1 MPI tasks x no OpenMP threads

# similar to above one worker
time mpirun -n 1 --nooversubscribe --bind-to core --report-bindings \
--host localhost \
./lmp_mpi -suffix cpu -in in.colloid > out.colloid &
 [greentail52:217279] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..]
Total wall time: 0:07:29
Performance: 480542.270 tau/day, 1112.366 timesteps/s
100.0% CPU use with 1 MPI tasks x no OpenMP threads

# two workers
time mpirun -n 2 --nooversubscribe --bind-to core --report-bindings \
--host localhost,localhost \
./lmp_mpi -suffix cpu -in in.colloid > out.colloid &
[greentail52:217517] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..]
[greentail52:217517] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..]
Total wall time: 0:03:57
Performance: 908664.629 tau/day, 2103.390 timesteps/s
98.2% CPU use with 2 MPI tasks x no OpenMP threads

# four workers
time mpirun -n 4 --nooversubscribe --bind-to core --report-bindings \
--host localhost,localhost,localhost,localhost \
./lmp_mpi -suffix cpu -in in.colloid > out.colloid &
[greentail52:217763] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..]
[greentail52:217763] MCW rank 1 bound to socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..]
[greentail52:217763] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..]
[greentail52:217763] MCW rank 3 bound to socket 1[core 5[hwt 0-1]]: [../../../..][../BB/../..]
Total wall time: 0:02:16
Performance: 1584289.072 tau/day, 3667.336 timesteps/s
96.1% CPU use with 4 MPI tasks x no OpenMP threads

# eight workers
time mpirun -n 8 --nooversubscribe --bind-to core --report-bindings \
--host localhost,localhost,localhost,localhost,localhost,localhost,localhost,localhost \
./lmp_mpi -suffix cpu -in in.colloid > out.colloid &
[1] 217864
[greentail52:217865] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..]
[greentail52:217865] MCW rank 1 bound to socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..]
[greentail52:217865] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..]
[greentail52:217865] MCW rank 3 bound to socket 1[core 5[hwt 0-1]]: [../../../..][../BB/../..]
[greentail52:217865] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/..][../../../..]
[greentail52:217865] MCW rank 5 bound to socket 1[core 6[hwt 0-1]]: [../../../..][../../BB/..]
[greentail52:217865] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB][../../../..]
[greentail52:217865] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [../../../..][../../../BB]
Total wall time: 0:01:21
Performance: 2648317.900 tau/day, 6130.366 timesteps/s
95.8% CPU use with 8 MPI tasks x no OpenMP threads

n79

Our latest hardware purchase from a company in CA, a dozen of these.

# dual 12-core cpus
model name      : Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

# running
CentOS Linux release 7.7.1908 (Core)
Linux n79 3.10.0-1062.9.1.el7.x86_64

New hardware Testing

Same setup as above, same sequence, but strange results! We recompiled the binary on node n79 and ran it from /tmp/foo making sure we got physical cpu cores allocated.

But no mpirun and mpirun -n 1 are very different. There is a serious performance degradation but why? Adding workers improves the results but we never get anything decent till we allocated 8+ worker, but even then, we run almost 2x slower than on host greentail52. What could explain this?

# no mpirun
time ./lmp_mpi-n79 ....
Total wall time: 0:07:43
Performance: 465951.274 tau/day, 1078.591 timesteps/s
99.7% CPU use with 1 MPI tasks x no OpenMP threads

# 1 worker
time mpirun ... ./lmp_mpi-n79 ...
Total wall time: 0:11:33
Performance: 311317.460 tau/day, 720.642 timesteps/s
99.7% CPU use with 1 MPI tasks x no OpenMP threads

# 2 workers
time mpirun -n 2 ... ./lmp_mpi-n79 ...
Total wall time: 0:06:02
Performance: 595262.663 tau/day, 1377.923 timesteps/s
97.9% CPU use with 2 MPI tasks x no OpenMP threads

# 4 workers
time mpirun -n 4 ... ./lmp_mpi-n79 ...
Total wall time: 0:03:44
Performance: 960223.978 tau/day, 2222.741 timesteps/s
96.5% CPU use with 4 MPI tasks x no OpenMP threads

# eight workers
time mpirun -n 8 ... ./lmp_mpi-n79 ...
Total wall time: 0:02:15
Performance: 1600344.593 tau/day, 3704.501 timesteps/s
96.6% CPU use with 8 MPI tasks x no OpenMP threads

Summary

Minutes processing times of multiple runs for the colloid lammps example.

	greentail52	n79	n79 perf	n79 hpc-perf
no mpirun	7:30	7:43	0:07:08	0:08:22
mpirun -n 1	7:29	11:33	0:07:12	0:08:23
mpirun -n 2	3:57	6:02	0:03:45	0:04:19
mpirun -n 4	2:16	3:44	0:02:09	0:02:28
mpirun -n 8	1:21	2:15	0:01:20	0:01:27
mpirun -n 12			0:01:08	0:01:11
mpirun -n 24			0:00:49	0:00:49

The “n79 perf” column are the results after I create a tuned profile named “performance” and then switch to that profile using tuned-adm and echo the string “performance” in relevant /sys files. Also does not matter if I run the lmp_mpi version (compiled on greentail) or lmp_mpi-n79 (compiled on n79). Yea.

The ““n79 hpc-perf” column are the results after I splice the /usr/lib/tuned/hpc-compute/tuned.conf settings into my “performance” profile above. Contains a lot of memory settings. We swap out “throughput-performance” for “latency-performance” and make sure the section [cpu] still points to performance.

Going with the short version, comment out all other options. We can enable when needed.

# original
[main]
include=throughput-performance

[cpu]
governor=performance


# current files
 1011  for i in \
`ls  /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`;\
 do echo performance >  $i; done
 1012  cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# holds on reboot? yes
 1013  cd /usr/lib/tuned/
 1014  mkdir performance
 1015  vi performance/tuned.conf 
 1016  tuned-adm profile
 1017  tuned-adm profile performance
 1018  tuned-adm active

Back

Table of Contents