How AI/ML Combines with eBPF to Help Troubleshoot, Secure, Monitor Linux Networking
At the recent Linux Plumbers Conference, there were at least 30 talks on eBPF (extended Berkeley Packet Filter), and its popularity has been consistently increasing for the past few years. It has quickly become not just an invaluable technology but also an in-demand skill. This blog is an attempt to share my understanding of eBPF, perspectives, and predictions on how this technology may evolve. I hope this blog inspires you to take a closer look at this technology and develop an appreciation for it. Special kudos to Isovalent 's Liz Rice for her hands-free and easy-to-understand workshop[1].
My journey into operating system troubleshooting began fresh out of college when I landed a role with IBM, joining the celebrated TCP/IP team tasked with developing the next version of OS/2, named Aurora. Grateful for the opportunity, I eagerly absorbed knowledge, devouring TCP/IP Illustrated books alongside my tasks. My fascination was sparked by tcpdump, our sole tool at that time (probably dating myself :-)). As seen in Figure 1 below, packets can vanish at various points not tracked by tcpdump. eBPF comes to the rescue here, providing packet flow insight, enhancing security, and adding observability to network infrastructure.
Before we delve into interesting use cases, let's refresh our understanding of eBPF.
What Is BPF/eBPF, and Why Is It Important?
BPF(Berkeley Packet Filter) is a small virtual machine that can run programs injected from the userspace in kernel space without changing/recompiling the kernel code, it was 1st implemented in Linux kernel 3.15(1992), was better known as packet filter language for tcpdump.
BPF evolved to what we call “extended BPF” or “eBPF” starting in kernel version 3.18 in 2014.
eBPF is a revolutionary kernel technology that allows developers to write custom code that can be loaded into the kernel dynamically, changing the way the kernel behaves.
This enables a new generation of highly performant networking, observability, and security tools. And as you’ll see, if you want to instrument an app with these eBPF-based tools, you don’t need to modify or reconfigure the app in any way, thanks to eBPF’s vantage point -powerful and privileged position within the kernel, as shown in Figure2.
Just a few of the things you can do with eBPF include:
eBPF programs are event-driven and are run when the kernel or an application passes a certain hook point. Pre-defined hooks include system calls, function entry/exit, kernel tracepoints, network events, and several others, as shown in Figure3.
If a predefined hook does not exist for a particular need, it is possible to create a kernel probe (kprobe) or user probe (uprobe) to attach eBPF programs almost anywhere in kernel or user applications, as shown in Figure4.
How are eBPF programs written?
In a lot of scenarios, eBPF is not used directly but indirectly via projects like Cilium, bcc, or bpftrace which provide an abstraction on top of eBPF and do not require writing programs directly but instead offer the ability to specify intent-based definitions which are then implemented with eBPF.
If no higher-level abstraction exists, programs need to be written directly. The Linux kernel expects eBPF programs to be loaded in the form of bytecode. While it is of course possible to write bytecode directly, the more common development practice is to leverage a compiler suite like Low Level Virtual Machine(LLVM) to compile pseudo-C code into eBPF bytecode.
Maps
eBPF programs use eBPF maps to store, share, and retrieve data across kernel and user space, enabling state storage and information sharing
The following is an incomplete list of supported map types to give an understanding of the diversity in data structures. For various map types, both a shared and a per-CPU variation is available.
In summary, eBPF program allow safe & efficient access into kernel operation by:
eBPF for Networking
You can see from Figure that ingress packet destined for an application has to travel thru network stack on the host and again on pod network stack, adding eBPF avoid such duplicate traversal.
eBPF based XDP (express Data Path) offers high performance packet processing within the Linux Kernel ideal for tasks like Distributed denial of service (DDoS) mitigation and packet monitoring Kernel bypass techniques, such as Data Plane Development Kit (DPDK), aim to achieve even lower latency and higher throughput by circumventing the kernel entirely. Both approaches have their strengths and trade-offs, with XDP providing kernel-based efficiency and compatibility, while kernel bypass offers ultra-low latency at the expense of increase complexity in user space applications.
Now that we know about eBPF, let's explore how and where it is currently leveraged under the hood for network troubleshooting, security, and observability.
eBPF in Kubernetes
eBPF represents a more modern and capable approach, addressing many of iptables' inherent limitations. Currently Kubernetes uses iptables for
iptables is widely supported and is the default operating model for a new Kubernetes cluster. Unfortunately, it runs into a few problems:
Under heavy traffic or frequent changes, iptables causes unpredictable performance degradation due to its sequential rule evaluation and the need for consistent rule updates, leading to significant penalties at scale. For instance, updating iptables rules for a 20,000-service cluster could take up to five hours, as found by Huawei.
As shown in Figure 10, after replacing iptables with eBPF in Kubernetes networking, performance tests for throughput, CPU usage, and latency indicate that eBPF scales effectively, even with 1 million rules. In contrast, iptables does not scale as well, showing a considerable performance hit with even a low number of rules, such as 1k or 10k, compared to eBPF.
eBPF for Security
The difference between a security tool and an observability tool that reports on events is that a security tool needs to be able to distinguish between events that are expected under normal circumstances and events that suggest malicious activity might be taking place. Policies have to take into account not just normal behavior when systems are fully functional, but also the expected error path behavior.
A security tool compares activity to a policy and takes some action when the activity is outside the policy, making it suspicious. That action would typically involve generating a security event log, which would usually get sent to a Security Information Event Management (SIEM) platform. It might also result in an alert to a human who will be called on to investigate what happened.
领英推荐
When an eBPF program is triggered at the entry point to a system call, it can access the arguments that user space has passed to that system call. If those arguments are pointers, the kernel will need to copy the pointed-to data into its own data structures before acting on that data. As illustrated in Figure11, there is a window of opportunity for an attacker to modify this data, after it has been inspected by the eBPF program but before the kernel copies it. Thus, the data being acted on might not be the same as what was captured by the eBPF program
The Linux Security Module (LSM) interface provides a set of hooks that each occur just before the kernel is about to act on a kernel data structure. The function called by a hook can make a decision about whether to allow the action to go ahead.
Firewalling and DDoS protection are a natural fit for eBPF programs attached early in the ingress path for network packets. And with the possibility of XDP programs offloaded to hardware, malicious packets may never even reach the CPU
For implementing more sophisticated network policies, such as Kubernetes policies determining which services are allowed to communicate with one another, eBPF programs that attach to points in the network stack can drop packets if they are determined to be out of policy.
eBPF’s use in security has evolved from low-level checks on system calls to much more sophisticated use of eBPF programs for security policy checks, in-kernel event filtering, and runtime enforcement
eBPF for Network Observability
eBPF provides an interesting tool that allows us to collect data that is otherwise not available in /proc or other static system representations.
Cilium is an open source project that has been designed on top of eBPF replacing need for iptables, to address the networking, security, and visibility requirements of container workloads.
Comprehensive connectivity observability requires insight across all layers, not just from a single layer or solely the application (limited to the L7 layer). Cilium, powered by eBPF, enables this, as shown in Figure 12.
Prometheus is a time series database owned by the Cloud Native Computing Foundation. Prometheus’s third party integrations are called “exporters”, which allow tools like Graphana to plot various metrics in various format.
Grafana is an open-source data analytics and visualization web application created by Grafana Labs. It lets you visualize time series data by compiling them into charts, graphs, or maps and it even provides alerting when connected to supported data sources.
The combination of Prometheus and Grafana Agent gives you control over the metrics you want to report, where they come from, and where they’re going. Once the data is in Grafana, it can be stored in a Grafana Mimir database. Grafana dashboards consist of visualizations populated by data queried from the Prometheus data source. The PromQL query filters and aggregates the data to provide you the insight you need. With those steps, we’ve gone from raw numbers, generated by software, into Prometheus, delivered to Grafana, queried by PromQL, and visualized by Grafana as shown in Figure 13.
eBPF successful usecases
Enhanced eBPF Capabilities with AI/ML Integration
Now that we understand eBPF, let's explore futuristic use cases where eBPF combines with the superpower of AI/ML.
To Summarise, here are BCC (BPF Compiler Collection) & BPF tracing tools
Further Reading
If you would like to learn more about eBPF, continue reading using the following additional materials:
Tutorials
Documentation
Talks
Generic
Deep Dives
Cilium
Books
Articles amp; Blogs
?
?
Thank you Raj Sahu! Simple and so rich of possibilities. ??
SVP Engineering at DDN | Hands-on Software & Product Leader | AI | Cloud | Scaling Teams | Startup Advisor
1 年Very well articulated, Raj Sahu !
Director @ Monifest Capitals | Art Consultancy, Ceramic Artist
1 年Raj i left technolgy after marriage ?? But good to see you got through it and how beautifully! All the best to you for all your future endeavours Love ??
Raj your article effectively outlined the advancements in AI & ML, and offer valuable insights into their practical use in evolution computer networks. Keep the good stuff coming
I'm deeply touched by the enthusiastic response to this eBPF blog, which has garnered over 10,000 impressions and vibrant community engagement. It's heartening to see its potential to inspire entrepreneurs, as noted in your offline and online comments. Immense gratitude for your support and the rich discussions that followed. Special thanks to Isovalent and Liz Rice for igniting my interest in eBPF with her talks on YouTube, books (referenced in the blog), and free workshops—highly recommended for those yet to explore.