Optimal execution of workloads with Kubernetes/OpenShift

Workloads in areas such as Telco 5G, financial services and data analytics often have requirements for optimal, low-latency execution.

A popular answer to that is using DPDK network applications deployed on Kubernetes/OpenShift with special performance-focused configurations outlined in this article.

DPDK ("Data Plane Development Kit") explained: bypass Kernel for Performance!

As demonstrated here, even when the OS and application are optimised to the extreme, DPDK still has up to 51% performance lead over the kernel networking stack.

No alt text provided for this image

Let's have a look how that "magic" works!

DPDK is the "Data Plane Development Kit". It aims at achieving high I/O performance and reaching high packet processing rates, largely because of bypassing kernel for networking processing. So your networking applications operate in userspace.

For example, instead of the NIC raising an interrupt to the CPU when a frame is received, the CPU runs a "poll mode driver" (PMD) to constantly poll the NIC for new packets.

No alt text provided for this image

See also:

However, you need a dedicated NIC for DPDK, which can be an issue within virtualised/containerised environments.

SR-IOV or "single root I/O virtualisation" to the rescue! The latter is explained in the next section.

SR-IOV ("single root I/O virtualisation") explained: Turn one NIC into many!

No alt text provided for this image

Providing network connectivity to VMs/containers on heavily virtualised / containerised servers is a challenge.?The hypervisor can share NICs between the VMs/containers using software, but at reduced network speed and with high CPU overhead.?A better approach is to build a single NIC that appears as multiple NICs to the software. It has one physical ethernet socket, but appears on the PCI Express bus as multiple NICs. The SR-IOV standard calls the master NIC the Physical Function (PF) and its VM-facing virtual NICs the Virtual Functions (VFs). See also:

DPDK has support for several SR-IOV network drivers, enabling creating a PF (Physical Function) and VFs, and also to launch containers / VMs and assign VFs to them using PCI Passthrough.

"NUMA nodes" explained: get fast CPU-memory interaction!

Simply put, by default, often CPU-s can interact with memory (RAM) which can be physically attached to themselves or to other CPU-s.

The (CPU + "its own physically attached RAM") is called "NUMA node" where supported. Local memory access is a major advantage, as it combines low latency with high bandwidth.

No alt text provided for this image

See also:

Thanks to Kubernetes/OpenShift "Topology Manager", you can control "placement" (affinity) of your containers among "NUMA nodes".

Dedicated CPUs and "huge page" memory

In general, memory is managed in blocks known as pages. On most systems, a page is 4Ki. CPUs have a built-in memory management unit that manages a list of these pages in hardware. It is using a small by design hardware cache (the Translation Lookaside Buffer -- TLB) of virtual-to-physical page mappings. Hardware instructions for CPU-s reference the virtual addresses of pages. When such a virtual address already exists in the TLB, the virtual-to-physical mapping can be resolved very quickly. If not, then it is a "TLB cache miss" situation, and then the system performs the slower, software-based virtual-to-physical address translation, often resulting in performance issues at high load. Since the limited size of the TLB hardware cache is fixed, the only way to reduce the chance of a "TLB cache miss" is to... increase the page size, so we have less pages to keep the address resolution mechanism busy!

In this context, a "huge page" is a memory page that is much larger than the standard 4Ki. For your reference, on x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi.

Read here about the "huge pages" support in OpenShift.

Dedicated CPUs are largely self-explanatory -- you explicitly control "placement" (affinity) of your containers to CPU-s.

OpenShift supports "PerformanceProfile" YAML configs ("Kinds") which help with the latter as well as plenty of other relevant "performance niceness".

See also:

  • Performance addons operator advanced configuration

  • A "typical baremetal" PerformanceProfile : https://github.com/openshift-kni/cnf-features-deploy/blob/master/feature-configs/typical-baremetal/performance/performance_profile.patch.yaml

Conclusion

In OpenShift 4, it is possible to use the DPDK libraries and attach a network interface (virtual function) directly to the pod.

To simplify the application building process, you can leverage Red Hat's DPDK builder image from the Red Hat registry. This base image allows developers to build applications powered by DPDK. See also:

Another convenient simplification of performance-focused networking configuration is using declarative node network configuration ("nmstate"). It abstracts you from the "zoo" of networking configuration tools.

No alt text provided for this image

See also:

You can certainly benefit from it with OpenShift as well. Example:?

  • https://github.com/openshift-kni/baremetal-deploy/tree/master/features/kubernetes-nmstate

Thank you!

In 2021 I published an Udemy course about Kubernetes configuration:

In 2022 I got my "Professional Cloud Network Engineer" (GCP) out of curiosity in networking.?

In 2023 I published this intro article about achieving performance with Kubernetes/OpenShift.

I hope you are now comfortable reading the following sentence :-)

"DPDK based network applications may require dedicated CPUs, huge page memory, and SR-IOV VF-s on the same NUMA node for optimal, low-latency execution."

要查看或添加评论,请登录

Michael Knyazev, PhD的更多文章

社区洞察

其他会员也浏览了