Abaqus & OpenMPI
Modern processors have an increasing number of cores. AMD integrates up to 96 Zen 4 cores in Genoa and up to 128 Zen 4c cores in Bergamo microarchitectures - the 4th generation of EPYC processors. The Intel Xeon offers up to 40 cores in the 10th generation server processors based on the Sunny Cove microarchitecture. The Xeon Phi 72x5 series can execute up to 288 threads with 72 cores. Nowadays even a common workstation usually offers up to 36 cores in a single CPU machine and up to 64 in the case of two CPUs in the box.
These architectures of CPUs are quite complex, especially from the perspective of memory access, especially the last-level cache (LLC) and off-chip memory. Each socket or even die is physically integrated with its own block of memory, where access to this local memory is significantly faster than to other blocks. For some applications, to achieve a optimal performance, each process or thread should use an allocated local resource as close as possible to the core, on which they are executed. The microarchitecture with separate memory for each processor and where the memory access time depends on the memory location relative to the processor is called Non-Uniform Memory Access (NUMA). To take advantage of the NUMA architecture and the trend of an increasing number of cores available on each socket, Abaqus jobs should be executed in a specific configuration to balance performance with available memory resources.
Intel Xeon and AMD EPYC processors offer hardware support for user control of the processor affinity and the allocation of both LLC and off-chip memory. Processor affinity binds processes or threads to specific cores of CPUs and blocks of memory. The bound processes or threads are executed on the selected cores only. Because Intel and AMD processors differ significantly in terms of architecture, the correct processor affinity starts to be even more challenging in the case of modern CPUs.
Parallel Execution in Abaqus
The latest Abaqus releases support three parallelization schemes: thread-based, MPI-based and hybrid MPI- and thread-based approach. The hybrid mode is the one recommended on NUMA machines. On a single machine with multi-core CPUs or multi-socket machine Abaqus/Standard and Abaqus/Explicit, use by default thread-based and MPI-based parallelization respectively. The job can be also easily run on several nodes of cluster using Abaqus launcher if the passwordless remote execution (e.g. via SSH) is configured correctly on all nodes. In this case, Abaqus/Standard uses hybrid parallelization - on each node a single MPI process is run and the number of threads per MPI process is equal to the number of processor cores assigned to each node. However MPI and hybrid modes can be enforced on a single machine as well. The threads_per_mpi_process option can be used in conjunction with the cpus option to run Abaqus job in MPI or hybrid mode on a single machine. The number of threads_per_mpi_process should be a divisor of the number of cores on the machine.
Take a closer look at running Abaqus' job in parallel in different modes. To run Abaqus/Standard job with thread-based parallelization only the cpus option is used:
abaqus -job job_name -cpus 16
By default, the job runs as a one process with 16+ threads:
This is also called a Shared Memory Parallelization (SMP) configuration. Please note the number of threads is bigger than 16. Additional "non-computational" threads are responsible for I/O operations, license checking, signaling, interprocess communications etc.
The same command runs Abaqus/Explicit in MPI mode - the job will be run on 16 cores with MPI-based domain-level parallelization using 16 MPI processes (ranks):
To run Abaqus/Standard job with MPI-based parallelization the threads_per_mpi_process option equal 1 is used:
abaqus -job job_name -cpus 16 threads_per_mpi_process 1
This is a pure Distributed Memory Parallelization (DMP) configuration:
The threads_per_mpi_process parameter can be used to limit the number of threads used per MPI process and run a job in a hybrid MPI- and thread-based parallelization:
abaqus -job job_name -cpus 16 -threads_per_mpi_process 4
In such a case, the Abaqus/Standard and Abaqus/Explicit job is run with 4 MPI processes with 4 threads (+ "non-computational") per process:
Message Passing Interface and Abaqus
The IBM Platfrom MPI is the default MPI implementation in the latest Abaqus releases. The Intel MPI is stated as qualified so both of them can theoretically be used. The support policy for MPI depends on Abaqus version and even HotFix sometimes! Moreover, there are a lot of issues about MPI implementation and Abaqus releases, e.g. for some configuration Abaqus solvers can go into infinity loop or crash, so be cautious with that. For more details, please see Abaqus Program Directories. However, there is one more option - OpenMPI can be used with Abaqus.
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a strong and active community - the consortium of academic, research, and industry members and contributors such as AWS, AMD, ARM, Intel, IBM, NVIDIA, Oracle and many others. OpenMPI is heavily used by many HPC centers around the world like Sandia National Laboratories, Oak Ridge National Laboratory, Los Alamos National Laboratory and many others.
When Abaqus is used with Platfrom MPI, the process affinity is applied only in the case when a job is run on more than two hosts/nodes, and all cores on each host/node are used. In the past, when machines had a few cores, running jobs on all of them was a common scenario, but it is no longer the case. Today, for many users the number of licenses defines the limits of used cores and this limit is usually lower than the total number of cores. As a result of that, in many cases the Abaqus job processes are not bound at all. It can decrease the performance by 5-15% for typical jobs, and even much more for specific hardware configuration and jobs. The plot below shows the normalized total time of Abaqus benchmark jobs which were run in Abaqus 2023 in hybrid mode (2 MPI ranks x 7 threads each) on NUMA machine with default Platform MPI and OpenMPI with binding to NUMA nodes:
OpenMPI and Abaqus
OpenMPI has a few more advantages in contrast to IBM Platform MPI. The Platform MPI was replaced by IBM with Spectrum MPI. There are no more updates for the Platform MPI and also the Platform MPI documentation is limited. OpenMPI is actively developed, supported and well documented. The IBM Spectrum MPI is based on OpenMPI. OpenMPI documentation for the all versions is available on the project website. In the case of any problems, the OpenMPI Github account, where more than 500 issues were reported, it is a useful place to find information and help.
?OpenMPI offers a very straightforward, flexible and efficient approach for process affinity. In general, it is a three stage approach where the process is first mapped to a specific slot, then ranked and finally bound to it. In the case of Abaqus, a user can control the last stage - binding. The Abaqus process and its threads can be bound to a hardware thread (SMT), core, L1 cache, L2 cache, L3 cache, socket, NUMA node or explicitly defined list of cores. In the case of NUMA-based multi-socket machines or modern complex CPUs architecture like AMD EPYC family, the most interesting options are binding to last-level cache (L3 cache) or sockets/numa nodes.
Last but not least, the OpenMPI is seamlessly integrated with one of the best, highly scalable workload manager systems - the Slurm. Slurm can be effectively used on a single machine and thousand-node clusters to manage computational resources and schedule tasks e.g. Abaqus jobs. It can handle license management, user access and software usage reporting as well. There are three different modes how Slurm launches MPI jobs:
- Slurm directly launches the tasks and performs initialization of communications through the PMI APIs. It is supported by most modern MPI implementations but cannot be used with Abaqus, because Abaqus job is not a set of simple tasks from Slurm perspective.
- Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (srun). This mode is supported with Abaqus if OpenMPI is used.
- Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH. This mode is supported with Abaqus if Platform MPI is used. In this case the Abaqus job is initiated outside of Slurm's monitoring or control. To manage such tasks, an access to the nodes from the batch node is required (e.g. via SSH with Host-based Authentication).
Using second mode makes running Abauqs jobs using MPI on Slurm trivial. If you are interested in installing and configuring Slurm and Abaqus, feel free to reach me directly or contact with TECHNIA.
Installing and configuring OpenMPI with Abaqus
Please find below a short instruction on how to install and configure OpenMPI 4.0.7 with Abaqus 2023 on Linux RHEL-like distribution.
First download OpenMPI from the project website, rebuild RPM package and install it:
$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.7-1.src.rpm
$ sudo rpmbuild --rebuild openmpi-4.0.7-1.src.rpm
$ yum localinstall -y /root/rpmbuild/RPMS/x86_64/openmpi-4.0.7-1.el7.x86_64.rpm
Add to abaqus_v6.env (user level) or custom_v6.env (system level):
###################################################
# OpenMPI (4.0.7)
mp_mpi_implementation=OMPI
mp_mpirun_path={OMPI: '/usr/bin/mpirun'}
That's it! Now Abaqus will use OpenMPI to run jobs in MPI and hybrid parallel modes.
Binding Abaqus jobs with OpenMPI
To configure Abaqus processes affinity with OpenMPI and to bind it to NUMA nodes, add the following line to your abaqus_v6.env or custom_v6.env files:
mp_mpirun_options="--bind-to numa -report-bindings"
The first option --bind-to numa is used to bind ranks to NUMA nodes. The second -report-bindings option can be used to save to *.log file the affinity mask which brings information how the Abaqus processes are bound to cores and sockets:
[budsoft20:24939] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]]socket 0[core 7[hwt 0-1]]:
[BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[budsoft20:24939] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]:
[../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
The process binding enforces usage of a memory block which is assigned to the socket (node in the NUMA terminology) where the process is run from the viewpoint of microarchitecture topology. It can be clearly shown with numastat command:
Per-node process memory usage (in MBs) for PID 25056 (standard)
??????????????????????????Node 0? ? ? ? ? Node 1 ? ? ? ? ? Total
?????????????????--------------- --------------- ---------------
Huge ? ? ? ? ? ? ? ? ? ? ? ? 0.00? ? ? ? ? ? 0.00? ? ? ? ? ? 0.00
Heap ? ? ? ? ? ? ? ? ? ? ? ? 5.74? ? ? ? ? ? 0.00? ? ? ? ? ? 5.74
Stack? ? ? ? ? ? ? ? ? ? ? ? 0.28? ? ? ? ? ? 0.00? ? ? ? ? ? 0.28
Private ? ? ? ? ? ? ? ? ? 2299.59 ? ? ? ? ? 34.72 ? ? ? ? 2334.31
----------------? --------------- --------------- ---------------
Total ? ? ? ? ? ? ? ? ? ? 2305.61 ? ? ? ? ? 34.72 ? ? ? ? 2340.33
Per-node process memory usage (in MBs) for PID 25057 (standard)
??????????????????????????Node 0? ? ? ? ? Node 1 ? ? ? ? ? Total
?????????????????--------------- --------------- ---------------
Huge ? ? ? ? ? ? ? ? ? ? ? ? 0.00? ? ? ? ? ? 0.00? ? ? ? ? ? 0.00
Heap ? ? ? ? ? ? ? ? ? ? ? ? 0.00? ? ? ? ? ? 5.74? ? ? ? ? ? 5.74
Stack? ? ? ? ? ? ? ? ? ? ? ? 0.00? ? ? ? ? ? 0.27? ? ? ? ? ? 0.27
Private ? ? ? ? ? ? ? ? ? ? 54.18 ? ? ? ? 2238.79 ? ? ? ? 2292.97
----------------? --------------- --------------- ---------------
Total ? ? ? ? ? ? ? ? ? ? ? 54.18 ? ? ? ? 2244.80 ? ? ? ? 2298.98
As we can see, the two Abaqus standard processes (PIDs 25056 and 25057) use Private memory on two different nodes. When the process binding is not used, both processes use Private memory on both nodes at the same time:
Per-node process memory usage (in MBs) for PID 31842 (standard)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 0.35 5.46 5.81
Stack 0.20 0.07 0.27
Private 1075.83 1223.94 2299.77
---------------- --------------- --------------- ---------------
Total 1076.39 1229.47 2305.86
Per-node process memory usage (in MBs) for PID 31843 (standard)
Node 0 Node 1 Total
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 5.09 0.74 5.83
Stack 0.07 0.21 0.28
Private 798.13 1294.96 2093.09
---------------- --------------- --------------- ---------------
Total 803.29 1295.91 2099.20
The data bandwidth between socket 0 and memory in node 1 and vice versa is significantly slower than between sockets and memory in the same NUMA nodes whats decreases performance by ~10% in this case (Abaqus benchmark job s4d).
Final remarks
The OpenMPI can be successfully used with the latest Abaqus releases on Linux machines. Please note that this is an unofficial and undocumented configuration. In the case of any problems DS SIMULIA will not support it. But my experience is that OpenMPI with Abaqus tends to solve problems rather than create new ones. If you use OpenMPI with Abaqus, feel free to share your experience with other Abaqus users on my Github.
Happy running Abaqus jobs in parallel with OpenMPI on Linux!
Manager CAE bei NSK
1 å¹´Nice article! Do you have any experience with running MPI on windows? For me, it never worked, solver times were super high, and I could not get any other MPI implementations to work.