- KubeCon is the largest cloud-native and Kubernetes conference, and is organized by
The Linux Foundation
, and
Cloud Native Computing Foundation (CNCF)
. It’s held three times a year in North America, Europe, and China, respectively.
- KubeCon + CloudNativeCon + Open Source Summit + AI_dev China 2024 was held in Hong Kong for the first time from August 21-23, 2024.?There were more than 1,000 attendees from over 30 countries and regions. It's much smaller than the North America (NA) and Europe (EU) conferences, with most attendees being Chinese developers and contributors. However, the atmosphere and engagement were great, as many participants were engineers, technologists, CNCF contributors, and maintainers.?
- Notable participants included
Jim Zemlin
, Executive Director of the Linux Foundation;
Priyanka Sharma
, Executive Director of the Cloud Native Computing Foundation (CNCF);
Stormy Peters
, VP of Community at GitHub;
Chris Aniszczyk
, CTO of the?Cloud Native Computing Foundation;?Linus Torvalds, Creator of Linux and Git; and
Dirk Hohndel
, Head of the Open Source Program Office, Verizon.?
- Everyone wants AI, and discussions about AI and GPUs dominated the event. A significant trend and hot topic is how to manage and use (NVIDIA) GPUs more efficiently and reliably for AI and large language model (LLM) workloads, both for training and inference. There were a relatively large number of sessions and projects focused on GPU resource management, sharing, scheduling, failure recovery, and fault tolerance.
- China has consistently been the second-largest contributor to CNCF and Kubernetes projects, right behind the US. According to GitHub's statistics, China has the second-highest number of open-source software developers: US (22.7%), China (9.67%).
- NVIDIA’s presence: NVIDIA had three sessions at this KubeCon, including a keynote, a regular session, and a panel. The community expects a lot more from NVIDIA and seeks greater collaboration, including use cases, documentation, APIs, and open-source implementation, especially in the Kubernetes context.
- Highlight: Linus Torvalds, creator of Linux and Git attended the event, participating in a panel interview and a Linux Kernel maintainer meeting.
- Keynote: Supporting Large-Scale and Reliability Testing in Kubernetes GPU Clusters using KWOK, Yuan Chen, NVIDIA & Shiming Zhang, DaoCloud. This talk focused on NVIDIA’s open-source contributions to the CNCF project KWOK, a popular Kubernetes testing toolkit. NVIDIA is working with the community on failure simulation in Kubernetes, which is essential for reliability and fault-tolerance testing of GPU clusters.
- Kubernetes Community Panel: A Decade of Evolution and Future Trends, Klaus Ma, NVIDIA, participated in a panel featuring some of the CNCF community's most influential contributors and maintainers from China, celebrating the 10th anniversary of Kubernetes. China has the second-highest number of open-source developers and ranks second in contributions to CNCF/Kubernetes and GitHub, just behind the US.
- Simplify AI Infrastructure with Kubernetes Operators, Ganeshkumar Ashokavardhanan, Microsoft & Tariq Ibrahim, NVIDIA. This presentation showcased and demonstrated how GPUs and Kubernetes operators are used to simplify the lifecycle management of AI infrastructure.
- GPU Management:? Unlocking Heterogeneous AI Infrastructure K8s Cluster: Leveraging the Power of HAMi, Xiao Zhang, DaoCloud & Mengxuan Li, The 4th Paradigm. Heterogeneous AI Computing Virtualization Middleware (HAMi), formerly known as k8s-vGPU-scheduler, is an "all-in-one" chart designed to manage Heterogeneous AI Computing Devices in a k8s cluster. HAMi is a CNCF sandbox project. It implements dynamic NVIDIA GPU sharing and priority-based scheduling methods by hacking CUDA, and plans to support other GPUs as well. HAMi shares many common motivations and goals with DRA, and it will be interesting to see how it connects and integrates with DRA.
- Failure Recovery and Fault-Tolerance: Sit Back and Relax with Fault Awareness and Robust Instant Recovery for Large Scale AI Workloads. Fanshi Zhang & Kebe Liu, DaoCloud.?This is one of the most interesting talks I’ve attended. It introduces a series of mechanisms for failure identification, root cause analysis, and mitigation. The tool kcover is available https://github.com/BaizeAI/kcover.
- Failure Recovery and Fault-Tolerance: Detecting and Overcoming GPU Failures During ML Training, Ganeshkumar Ashokavardhanan, Microsoft & Sarah Belghiti, Wayve. This talk discusses how observability (leveraging tools like NVIDIA DCGM) enables proactive problem detection through GPU health checks and shares best practices for efficient identification, remediation, and prevention of GPU failures.
- Topology-aware Scheduling: ?Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM, William Wang, Huawei. William is the maintainer of Volcano, a popular Kubernetes batch system/scheduler. The current implementation supports NVLink but not NVSwitch yet, due to the unavailability of hardware. They plan to open-source this functionality as part of Volcano soon.
- Efficient GPU Resource Management and Scheduling: Is Your GPU Really Working Efficiently in the Data Center? N Ways to Improve GPU Usage, Xiao Zhang, DaoCloud & Wu YingJun, China Mobile. The talk covers various topics, including checkpointing, GPU sharing, topology-aware scheduling, and elastic quota at a high level.
- SLURM + Kubernetes: Breaking Boundaries: TACC as an Unified Cloud-Native Infra for AI + HPC, Peter Pan, DaoCloud & Kaiqiang Xu, Hong Kong University of Science and Technology. he most interesting part is co-hosting Kubernetes & SLURM on the same cluster (demo code). ?
- Model Openness Tool: https://isitopen.ai, a tool developed by the Linux Foundation AI & Data Foundation for evaluating and classifying the completeness and openness of machine learning models. This framework assesses which components of the model development lifecycle are publicly released and under what licenses. It is constantly evolving.
- Linus Torvalds’ appearance at KubeCon created quite a buzz. It’s not surprising he is very cautious about AI. He emphasized he’s a kernel developer who doesn’t understand AI or cloud computing, “I know Linux and the kernel. I don’t know Cloud (or AI).”
- Linus is open and optimistic about leveraging AI to enhance and advance Linux kernel development tools, such as identifying good programming patterns, detecting issues and bugs, and assisting with code review and documentation generation.