How AI/ML Combines with eBPF to Help Troubleshoot, Secure, Monitor Linux Networking
Photo Credit: Aarna Sahu

How AI/ML Combines with eBPF to Help Troubleshoot, Secure, Monitor Linux Networking

At the recent Linux Plumbers Conference, there were at least 30 talks on eBPF (extended Berkeley Packet Filter), and its popularity has been consistently increasing for the past few years. It has quickly become not just an invaluable technology but also an in-demand skill. This blog is an attempt to share my understanding of eBPF, perspectives, and predictions on how this technology may evolve. I hope this blog inspires you to take a closer look at this technology and develop an appreciation for it. Special kudos to Isovalent 's Liz Rice for her hands-free and easy-to-understand workshop[1].

My journey into operating system troubleshooting began fresh out of college when I landed a role with IBM, joining the celebrated TCP/IP team tasked with developing the next version of OS/2, named Aurora. Grateful for the opportunity, I eagerly absorbed knowledge, devouring TCP/IP Illustrated books alongside my tasks. My fascination was sparked by tcpdump, our sole tool at that time (probably dating myself :-)). As seen in Figure 1 below, packets can vanish at various points not tracked by tcpdump. eBPF comes to the rescue here, providing packet flow insight, enhancing security, and adding observability to network infrastructure.

Figure1. Troubleshooting beyond tcpdump (Photo Credit - Martynas Pumputis amp; Aditi Ghag)

Before we delve into interesting use cases, let's refresh our understanding of eBPF.

What Is BPF/eBPF, and Why Is It Important?

BPF(Berkeley Packet Filter) is a small virtual machine that can run programs injected from the userspace in kernel space without changing/recompiling the kernel code, it was 1st implemented in Linux kernel 3.15(1992), was better known as packet filter language for tcpdump.

BPF evolved to what we call “extended BPF” or “eBPF” starting in kernel version 3.18 in 2014.

eBPF is a revolutionary kernel technology that allows developers to write custom code that can be loaded into the kernel dynamically, changing the way the kernel behaves.

This enables a new generation of highly performant networking, observability, and security tools. And as you’ll see, if you want to instrument an app with these eBPF-based tools, you don’t need to modify or reconfigure the app in any way, thanks to eBPF’s vantage point -powerful and privileged position within the kernel, as shown in Figure2.

Just a few of the things you can do with eBPF include:

  • Performance tracing of pretty much any aspect of a system
  • High-performance networking, with built-in visibility
  • Detecting and (optionally) preventing malicious activity

Figure2. eBPF program is attached to events in kernel (Photo: Learning eBPF Book)

eBPF programs are event-driven and are run when the kernel or an application passes a certain hook point. Pre-defined hooks include system calls, function entry/exit, kernel tracepoints, network events, and several others, as shown in Figure3.

Figure3. eBPF with predefined hooks (like system call) (Photo Credit:

If a predefined hook does not exist for a particular need, it is possible to create a kernel probe (kprobe) or user probe (uprobe) to attach eBPF programs almost anywhere in kernel or user applications, as shown in Figure4.

Figure4. eBPF with user probe and kernel probe (Photo Credit:

How are eBPF programs written?

In a lot of scenarios, eBPF is not used directly but indirectly via projects like Cilium, bcc, or bpftrace which provide an abstraction on top of eBPF and do not require writing programs directly but instead offer the ability to specify intent-based definitions which are then implemented with eBPF.

Figure5. Low Level Virtual Machine (LLVM) with Clang (compiler frontend) creates eBPF bytecode (Photo Credit:

If no higher-level abstraction exists, programs need to be written directly. The Linux kernel expects eBPF programs to be loaded in the form of bytecode. While it is of course possible to write bytecode directly, the more common development practice is to leverage a compiler suite like Low Level Virtual Machine(LLVM) to compile pseudo-C code into eBPF bytecode.

Maps

eBPF programs use eBPF maps to store, share, and retrieve data across kernel and user space, enabling state storage and information sharing

Figure7. eBPF maps for share data between kernel and user space (Photo Credit:

The following is an incomplete list of supported map types to give an understanding of the diversity in data structures. For various map types, both a shared and a per-CPU variation is available.

  • Hash tables, Arrays
  • LRU (Least Recently Used)
  • Ring Buffer
  • Stack Trace
  • LPM (Longest Prefix match)

In summary, eBPF program allow safe & efficient access into kernel operation by:

  1. Providing built-in hooks for programs based on system calls, kernel functions, network events and other triggers
  2. Providing a mechanism for compiling and verifying code prior to running, which helps ensure security and stability of the system
  3. Offering a more straightforward way to enhance kernel functionality than is possible through LKMs (Linux Kernel Modules), thereby allowing even small teams to efficiently develop safe programs that run in kernel space

eBPF for Networking

Figure8. Bypassing iptables and conntrack processing with eBPF (Photo Credit -

You can see from Figure that ingress packet destined for an application has to travel thru network stack on the host and again on pod network stack, adding eBPF avoid such duplicate traversal.

Figure9. eBPF based XDP(express Data Path) simplifies networking when compared to Kernel Bypass

eBPF based XDP (express Data Path) offers high performance packet processing within the Linux Kernel ideal for tasks like Distributed denial of service (DDoS) mitigation and packet monitoring Kernel bypass techniques, such as Data Plane Development Kit (DPDK), aim to achieve even lower latency and higher throughput by circumventing the kernel entirely. Both approaches have their strengths and trade-offs, with XDP providing kernel-based efficiency and compatibility, while kernel bypass offers ultra-low latency at the expense of increase complexity in user space applications.

Now that we know about eBPF, let's explore how and where it is currently leveraged under the hood for network troubleshooting, security, and observability.

eBPF in Kubernetes

eBPF represents a more modern and capable approach, addressing many of iptables' inherent limitations. Currently Kubernetes uses iptables for

  1. kube-proxy: the component which implements services and load balancing by DNAT iptables rules
  2. CNI (Container Network Interface) plugins

iptables is widely supported and is the default operating model for a new Kubernetes cluster. Unfortunately, it runs into a few problems:

  • iptables updates are made by recreating and updating all rules in a single transaction.
  • iptables is implemented as a chain of rules in a linked list, so all operations are O(n).
  • iptables implements access control as a sequential list of rules (also O(n)).
  • Every time you have a new IP or port to match, rules need to be added and the chain changed.
  • Has high consumption of resources on Kubernetes.
  • The shift from iptables to eBPF offers tangible benefits: improved application performance, simplified network operations, and enhanced security, leading to cost savings and better resource utilization, as shown in Figure 10.

Figure10. eBPF reducing time complexity for search/insert/delete to O(1)

Under heavy traffic or frequent changes, iptables causes unpredictable performance degradation due to its sequential rule evaluation and the need for consistent rule updates, leading to significant penalties at scale. For instance, updating iptables rules for a 20,000-service cluster could take up to five hours, as found by Huawei.

As shown in Figure 10, after replacing iptables with eBPF in Kubernetes networking, performance tests for throughput, CPU usage, and latency indicate that eBPF scales effectively, even with 1 million rules. In contrast, iptables does not scale as well, showing a considerable performance hit with even a low number of rules, such as 1k or 10k, compared to eBPF.

eBPF for Security

The difference between a security tool and an observability tool that reports on events is that a security tool needs to be able to distinguish between events that are expected under normal circumstances and events that suggest malicious activity might be taking place. Policies have to take into account not just normal behavior when systems are fully functional, but also the expected error path behavior.

A security tool compares activity to a policy and takes some action when the activity is outside the policy, making it suspicious. That action would typically involve generating a security event log, which would usually get sent to a Security Information Event Management (SIEM) platform. It might also result in an alert to a human who will be called on to investigate what happened.

When an eBPF program is triggered at the entry point to a system call, it can access the arguments that user space has passed to that system call. If those arguments are pointers, the kernel will need to copy the pointed-to data into its own data structures before acting on that data. As illustrated in Figure11, there is a window of opportunity for an attacker to modify this data, after it has been inspected by the eBPF program but before the kernel copies it. Thus, the data being acted on might not be the same as what was captured by the eBPF program

The Linux Security Module (LSM) interface provides a set of hooks that each occur just before the kernel is about to act on a kernel data structure. The function called by a hook can make a decision about whether to allow the action to go ahead.

Figure11. eBPF securing kernel (Photo: Learning eBPF Book

Firewalling and DDoS protection are a natural fit for eBPF programs attached early in the ingress path for network packets. And with the possibility of XDP programs offloaded to hardware, malicious packets may never even reach the CPU

For implementing more sophisticated network policies, such as Kubernetes policies determining which services are allowed to communicate with one another, eBPF programs that attach to points in the network stack can drop packets if they are determined to be out of policy.

eBPF’s use in security has evolved from low-level checks on system calls to much more sophisticated use of eBPF programs for security policy checks, in-kernel event filtering, and runtime enforcement

eBPF for Network Observability

eBPF provides an interesting tool that allows us to collect data that is otherwise not available in /proc or other static system representations.

Cilium is an open source project that has been designed on top of eBPF replacing need for iptables, to address the networking, security, and visibility requirements of container workloads.

Comprehensive connectivity observability requires insight across all layers, not just from a single layer or solely the application (limited to the L7 layer). Cilium, powered by eBPF, enables this, as shown in Figure 12.


Figure12. Monitoring across all layers

Prometheus is a time series database owned by the Cloud Native Computing Foundation. Prometheus’s third party integrations are called “exporters”, which allow tools like Graphana to plot various metrics in various format.

Grafana is an open-source data analytics and visualization web application created by Grafana Labs. It lets you visualize time series data by compiling them into charts, graphs, or maps and it even provides alerting when connected to supported data sources.

The combination of Prometheus and Grafana Agent gives you control over the metrics you want to report, where they come from, and where they’re going. Once the data is in Grafana, it can be stored in a Grafana Mimir database. Grafana dashboards consist of visualizations populated by data queried from the Prometheus data source. The PromQL query filters and aggregates the data to provide you the insight you need. With those steps, we’ve gone from raw numbers, generated by software, into Prometheus, delivered to Grafana, queried by PromQL, and visualized by Grafana as shown in Figure 13.

Figure13. Raw data from Linux internal to Visual insight with Graphana Dashboard (Photo:

eBPF successful usecases

  1. Netflix uses eBPF at scale for network insights
  2. Apple uses eBPF through Falco for kernel security monitoring
  3. Ikea uses eBPF through Cilium for networking and load balancing in their private cloud
  4. Walmart uses eBPF for edge cloud load balancing
  5. Cruise uses eBPF to monitor GPU performance
  6. Sysdig uses eBPF to enable high-performance system call tracing, facilitate container-aware troubleshooting, conduct security auditing, and provide rich insights and data from the kernel
  7. Meta uses eBPF to process and load balance every packet coming into their data centers (refer project Katran)
  8. Bell Canada uses eBPF to improve Telco Networking with Segment Routing (SR)

Enhanced eBPF Capabilities with AI/ML Integration

Now that we understand eBPF, let's explore futuristic use cases where eBPF combines with the superpower of AI/ML.

  1. AI-Powered Network Security: eBPF can be used to monitor network traffic in real-time, allowing AI models to analyze patterns and detect anomalies or potential security threats, such as DDoS attacks or network intrusions. By providing detailed insights into packet flows and system calls, eBPF enables AI systems to make informed decisions on blocking malicious activities or alerting administrators.
  2. ML powered Performance Monitoring and Optimization: eBPF can collect detailed performance metrics from applications and the kernel, which can be fed into ML models to predict potential bottlenecks or failures. These insights can help in auto-tuning system parameters for optimal performance or in dynamically adjusting resources allocation to improve efficiency and reduce latency.
  3. AI Model Training Observability: Training AI models can be resource-intensive and time-consuming. eBPF can be used to observe and collect metrics on resource usage (CPU, memory, I/O) at a granular level during the training process. This data can help identify inefficiencies and optimize the training process, potentially reducing the time and resources required.
  4. AI in Fraud Detection: In financial services, eBPF can be used to monitor and log transactions in real-time. AI models can analyze this data to detect unusual patterns indicative of fraud. This approach allows for immediate detection and mitigation actions, significantly reducing the risk and impact of fraudulent activities.
  5. Predictive Maintenance Using ML: In IoT and industrial contexts, eBPF can collect data from various sensors and devices, which can be analyzed by ML models to predict equipment failures before they occur. This predictive maintenance can save costs and prevent downtime by scheduling repairs or replacements in advance.
  6. AI-Based Dynamic Load Balancing: For cloud services and distributed systems, eBPF can monitor traffic and system metrics, enabling AI algorithms to dynamically adjust load balancing strategies. This can ensure optimal resource utilization and improve user experience by reducing response times and avoiding bottlenecks.
  7. AI/ML-Powered Telemetry Root Cause Analysis: eBPF can provide detailed telemetry about system behavior and application performance. AI/ML models can analyze this data to automate root cause analysis of system issues, reducing the time needed to diagnose and resolve problems.

To Summarise, here are BCC (BPF Compiler Collection) & BPF tracing tools

BPF Compiler tool (Photo credit

Further Reading

If you would like to learn more about eBPF, continue reading using the following additional materials:

Tutorials

  1. Liz Rice Hands-on easy to follow workshop
  2. Learn eBPF Tracing: Tutorial and Examples, Brendan Gregg 's Blog, Jan 2019

Documentation

  1. BPF & XDP Reference Guide, Cilium Documentation, Aug 2020
  2. eBPF usecases
  3. Cilium usecases

Talks

Generic

Deep Dives

Cilium

Books

Articles amp; Blogs

?

?


Thank you Raj Sahu! Simple and so rich of possibilities. ??

回复
Anil Virmani

SVP Engineering at DDN | Hands-on Software & Product Leader | AI | Cloud | Scaling Teams | Startup Advisor

1 年

Very well articulated, Raj Sahu !

回复
Meenal Jain

Director @ Monifest Capitals | Art Consultancy, Ceramic Artist

1 年

Raj i left technolgy after marriage ?? But good to see you got through it and how beautifully! All the best to you for all your future endeavours Love ??

Raj your article effectively outlined the advancements in AI & ML, and offer valuable insights into their practical use in evolution computer networks. Keep the good stuff coming

回复

I'm deeply touched by the enthusiastic response to this eBPF blog, which has garnered over 10,000 impressions and vibrant community engagement. It's heartening to see its potential to inspire entrepreneurs, as noted in your offline and online comments. Immense gratitude for your support and the rich discussions that followed. Special thanks to Isovalent and Liz Rice for igniting my interest in eBPF with her talks on YouTube, books (referenced in the blog), and free workshops—highly recommended for those yet to explore.

要查看或添加评论,请登录

Raj Sahu的更多文章

社区洞察

其他会员也浏览了