登录查看更多内容

Optimal execution of workloads with Kubernetes/OpenShift

Michael Knyazev, PhD

Red Hat Certified Architect, Sr. Consultant at Red Hat

发布日期: 2023年2月17日

Workloads in areas such as Telco 5G, financial services and data analytics often have requirements for optimal, low-latency execution.

A popular answer to that is using DPDK network applications deployed on Kubernetes/OpenShift with special performance-focused configurations outlined in this article.

DPDK ("Data Plane Development Kit") explained: bypass Kernel for Performance!

As demonstrated here, even when the OS and application are optimised to the extreme, DPDK still has up to 51% performance lead over the kernel networking stack.

Let's have a look how that "magic" works!

DPDK is the "Data Plane Development Kit". It aims at achieving high I/O performance and reaching high packet processing rates, largely because of bypassing kernel for networking processing. So your networking applications operate in userspace.

For example, instead of the NIC raising an interrupt to the CPU when a frame is received, the CPU runs a "poll mode driver" (PMD) to constantly poll the NIC for new packets.

See also:

However, you need a dedicated NIC for DPDK, which can be an issue within virtualised/containerised environments.

SR-IOV or "single root I/O virtualisation" to the rescue! The latter is explained in the next section.

SR-IOV ("single root I/O virtualisation") explained: Turn one NIC into many!

Providing network connectivity to VMs/containers on heavily virtualised / containerised servers is a challenge.?The hypervisor can share NICs between the VMs/containers using software, but at reduced network speed and with high CPU overhead.?A better approach is to build a single NIC that appears as multiple NICs to the software. It has one physical ethernet socket, but appears on the PCI Express bus as multiple NICs. The SR-IOV standard calls the master NIC the Physical Function (PF) and its VM-facing virtual NICs the Virtual Functions (VFs). See also:

DPDK has support for several SR-IOV network drivers, enabling creating a PF (Physical Function) and VFs, and also to launch containers / VMs and assign VFs to them using PCI Passthrough.

"NUMA nodes" explained: get fast CPU-memory interaction!

Simply put, by default, often CPU-s can interact with memory (RAM) which can be physically attached to themselves or to other CPU-s.

The (CPU + "its own physically attached RAM") is called "NUMA node" where supported. Local memory access is a major advantage, as it combines low latency with high bandwidth.

See also:

Thanks to Kubernetes/OpenShift "Topology Manager", you can control "placement" (affinity) of your containers among "NUMA nodes".

领英推荐

NVMe/TCP, Simplified!

Rajesh Dangi 1 年前

The Future of AI: Inside the XE Server Beast & Gen AI…

Segundo Ramos 9 个月前

Understanding 5G, A Practical Guide to Deploying and…

Houman S. Kaji 3 年前

Dedicated CPUs and "huge page" memory

In general, memory is managed in blocks known as pages. On most systems, a page is 4Ki. CPUs have a built-in memory management unit that manages a list of these pages in hardware. It is using a small by design hardware cache (the Translation Lookaside Buffer -- TLB) of virtual-to-physical page mappings. Hardware instructions for CPU-s reference the virtual addresses of pages. When such a virtual address already exists in the TLB, the virtual-to-physical mapping can be resolved very quickly. If not, then it is a "TLB cache miss" situation, and then the system performs the slower, software-based virtual-to-physical address translation, often resulting in performance issues at high load. Since the limited size of the TLB hardware cache is fixed, the only way to reduce the chance of a "TLB cache miss" is to... increase the page size, so we have less pages to keep the address resolution mechanism busy!

In this context, a "huge page" is a memory page that is much larger than the standard 4Ki. For your reference, on x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi.

Read here about the "huge pages" support in OpenShift.

Dedicated CPUs are largely self-explanatory -- you explicitly control "placement" (affinity) of your containers to CPU-s.

OpenShift supports "PerformanceProfile" YAML configs ("Kinds") which help with the latter as well as plenty of other relevant "performance niceness".

See also:

Performance addons operator advanced configuration

A "typical baremetal" PerformanceProfile : https://github.com/openshift-kni/cnf-features-deploy/blob/master/feature-configs/typical-baremetal/performance/performance_profile.patch.yaml

Conclusion

In OpenShift 4, it is possible to use the DPDK libraries and attach a network interface (virtual function) directly to the pod.

To simplify the application building process, you can leverage Red Hat's DPDK builder image from the Red Hat registry. This base image allows developers to build applications powered by DPDK. See also:

Another convenient simplification of performance-focused networking configuration is using declarative node network configuration ("nmstate"). It abstracts you from the "zoo" of networking configuration tools.

See also:

You can certainly benefit from it with OpenShift as well. Example:?

https://github.com/openshift-kni/baremetal-deploy/tree/master/features/kubernetes-nmstate

Thank you!

In 2021 I published an Udemy course about Kubernetes configuration:

In 2022 I got my "Professional Cloud Network Engineer" (GCP) out of curiosity in networking.?

In 2023 I published this intro article about achieving performance with Kubernetes/OpenShift.

I hope you are now comfortable reading the following sentence :-)

"DPDK based network applications may require dedicated CPUs, huge page memory, and SR-IOV VF-s on the same NUMA node for optimal, low-latency execution."

要查看或添加评论，请登录

Michael Knyazev, PhD的更多文章

Pragmatic AI Strategy: Design Choices and the Value of Humans and Algorithms

2025年1月9日

Pragmatic AI Strategy: Design Choices and the Value of Humans and Algorithms

by Michael Knyazev, PhD and Michael Yang 1. Business Cases, Opportunities and Problems The transformative potential of…
My Journey to Becoming a Red Hat Certified Architect (RHCA)

2024年10月29日

My Journey to Becoming a Red Hat Certified Architect (RHCA)

Why OpenShift and Ansible? The industry's shift from "Cloud" to "Multi-Cloud" and now "Hybrid Cloud" has been clear…

5 条评论
Efficient Management of Intra-Sprint Challenges in IT Consulting Projects

2024年10月26日

Efficient Management of Intra-Sprint Challenges in IT Consulting Projects

Introduction In IT consulting, balancing multiple critical projects with competing demands is both a challenge and an…
Benefit from Proven Software Development and QA Practices for IT Automation

2024年10月20日

Benefit from Proven Software Development and QA Practices for IT Automation

Introduction Good news! If you haven't yet leveraged the wealth of knowledge from software development, now's the time…

1 条评论
Continuous observability with Ansible

2024年3月21日

Continuous observability with Ansible

1. Introduction The agentless architecture of Ansible makes it secure, also thanks to continuous connectivity and thus…

1 条评论

See all articles

Optimal execution of workloads with Kubernetes/OpenShift

Michael Knyazev, PhD

Red Hat Certified Architect, Sr. Consultant at Red Hat

DPDK ("Data Plane Development Kit") explained: bypass Kernel for Performance!

SR-IOV ("single root I/O virtualisation") explained: Turn one NIC into many!

"NUMA nodes" explained: get fast CPU-memory interaction!

领英推荐

Dedicated CPUs and "huge page" memory

Conclusion

Thank you!

Michael Knyazev, PhD的更多文章

社区洞察

其他会员也浏览了

Evolution of Virtualization: A Historical perspective

RVA23 Profile: Unlocking new possibilities for RISC-V in high-performance, compute-intensive workloads

Container Security

Intel Xeon CPUs for Server Stability and Performance

Demystifying x86 Hypervisors: Pioneers in Modern Virtualization

New Pure Array and FPGA, Dell Precision Threadripper Review, New Supermicro EDSFF Servers

CXL technology unlocks memory performance

Project Monterey is Coming to HPE Discover with AMD Pensando

Chelsio Update - December 2022

PowerEdge R760 vs R750: The Ultimate Server Showdown

DPDK ("Data Plane Development Kit") explained: bypass Kernel for Performance!

SR-IOV ("single root I/O virtualisation") explained: Turn one NIC into many!

"NUMA nodes" explained: get fast CPU-memory interaction!

领英推荐

Dedicated CPUs and "huge page" memory

Conclusion

Thank you!

Michael Knyazev, PhD的更多文章

Pragmatic AI Strategy: Design Choices and the Value of Humans and Algorithms

My Journey to Becoming a Red Hat Certified Architect (RHCA)

Efficient Management of Intra-Sprint Challenges in IT Consulting Projects

Benefit from Proven Software Development and QA Practices for IT Automation

Continuous observability with Ansible

社区洞察

其他会员也浏览了

Evolution of Virtualization: A Historical perspective

RVA23 Profile: Unlocking new possibilities for RISC-V in high-performance, compute-intensive workloads

Container Security

Intel Xeon CPUs for Server Stability and Performance

Demystifying x86 Hypervisors: Pioneers in Modern Virtualization

New Pure Array and FPGA, Dell Precision Threadripper Review, New Supermicro EDSFF Servers

CXL technology unlocks memory performance

Project Monterey is Coming to HPE Discover with AMD Pensando

Chelsio Update - December 2022

PowerEdge R760 vs R750: The Ultimate Server Showdown