Our Experience at the Meta’s Networking@Scale Event
Co-authored by Tharuka Kodituwakku
Who We Are + Why We Attended
Omar's Opening Remarks
AI as an End-to-End Problem: AI requires seamless, real-time performance across the network. Whether users are interacting with Meta AI on their phones or using AI-powered chatbots, the network must deliver instant responses to ensure a smooth experience from data centers to devices.
Great Operations: Effective management of large AI clusters is crucial. Even minor network failures can disrupt AI training at Meta’s scale, so quickly detecting and resolving issues is essential to avoid delays and keep systems running efficiently.
Models and Networking Co-design: AI model developers and network engineers must collaborate to address the unique challenges of running models on thousands of GPUs. This coordination ensures both sides align to optimize performance and scalability for Meta’s AI-driven services.
Evolving Meta’s Edge Architecture
In their presentation, Shivkumar Chandrashekhar and Lee Hetherington discussed Meta’s transformation of its Edge CDN and Edge Cloud infrastructure to meet the rising demand for AI-generated, non-cacheable content. As AI and metaverse applications grow, Meta is evolving its architecture to support real-time, low-latency interactive experiences.
They traced the evolution of Meta’s content delivery from text-based media to immersive experiences, noting that AI-generated content adds complexity because it cannot be cached. Meta’s global Edge infrastructure includes Edge Metros, Meta Network Appliance (mNA) clusters, and an Edge Backbone that connects to its main data centers.
The speakers highlighted two challenges: the need for real-time processing of non-cacheable AI content and the ultra-low-latency requirements of metaverse and gaming applications. To address these, Meta is transitioning to a decentralized compute platform that brings computing power closer to users. This includes new hardware, like GPUs, to support diverse tasks.
With this re-architecture, security and management challenges arise. Meta is strengthening security by isolating Edge hosts and enhancing authentication protocols, while also improving fleet management with tools like a demand forecasting system and edge auto-scaler to dynamically allocate resources.
Future Challenges for HPC Networking
In his presentation, Ron Brightwell from Sandia National Laboratories explored the evolution of high-performance computing (HPC) networks over the past 30 years, focusing on the convergence of HPC with cloud computing, hyperscalers, and AI/ML workloads. He discussed both advancements and future challenges in HPC networking hardware and software.
Ron traced the development of HPC node architectures, from the simple Intel Paragon of 30 years ago to the complex HPE Frontier systems of today. Early nodes had tightly integrated components, while modern nodes feature CPUs, GPUs, and various memory types but still face coherence issues between network interfaces and processors. The rise of Ethernet-compatible networks like HPE Slingshot is also notable.
With increasing node complexity, integrating GPUs and standardizing network communication across vendors like NVIDIA and AMD is challenging. The role of SmartNICs, effective in cloud environments but less suited for HPC’s message-based communication, was also discussed. Networking hardware trends include faster link speeds, with 400 Gbps networks introducing new processing challenges, and the potential of Ethernet and chiplet technology to customize and improve HPC networks.
On the software side, Ron highlighted a “semantic mismatch” between network capabilities and management protocols. While APIs like OpenFabrics Interfaces and UCX address these issues, they add complexity. Standardization is needed to simplify network programming and enhance portability.
Looking ahead, Ron identified several key challenges: reducing network latency through direct integration of interfaces with memory buses, managing increased parallelism in network communications, ensuring resilience and controlling congestion, and incorporating event-driven communication models to support dynamic workflows in HPC environments.
AI Impact on the Backbone
Presentation Overview:
In their talk, Jyotsna Sundaresan and Abishek Gopalan from Meta discussed how the surge in AI demand from Meta’s apps (Facebook, Instagram, WhatsApp) has impacted their global backbone network. They highlighted the rapid increase in traffic driven by AI workloads and Meta’s strategies to manage this growth.
Initially, Meta expected AI traffic to remain within data centers, but the need for data replication, freshness, and cross-region inference led to a 30–50% rise in backbone traffic. They examined the AI traffic lifecycle, revealing that more data moves across regions than anticipated, particularly due to cross-region inference.
To control traffic growth, Meta has implemented several strategies:
Efficient Workload Placement: Optimizing the placement of AI workloads and data storage to reduce cross-region traffic.
Scheduled Bulk Transfers: Timing large data transfers during off-peak hours to minimize peak traffic impact.
Quality of Service (QoS) Initiatives: Prioritizing AI workloads to ensure critical services remain operational.
Building Larger Buffers: Increasing buffer capacity to handle unexpected traffic spikes and maintain consistent service.
Alibaba High-Performance Networking
In his presentation, Jiaqi Gao from Alibaba introduced the HPN 7.0 network architecture, tailored for the specific demands of large language model (LLM) training. He highlighted the unique challenges posed by LLMs compared to traditional cloud computing workloads and how HPN 7.0 addresses these issues to enhance AI infrastructure.
LLM training creates a few large, bursty data flows up to 400 Gbps per host, in contrast to the numerous smaller flows typical in traditional cloud workloads. The need for global synchronization across the cluster and the low entropy traffic from fewer hosts lead to inefficiencies and high network pressure.
HPN 7.0’s 2-tier, dual-plane design replaces the traditional 3-tier Clos architecture, connecting up to 15,000 GPUs within a single Pod. This design reduces complexity and efficiently manages large data flows while avoiding congestion. The dual-ToR (Top of Rack) setup ensures redundancy and prevents single points of failure.
Key features include a multi-rail topology that connects each accelerator to a dedicated switch, and a front-backend separation that optimizes CPU and accelerator traffic. The network supports up to 130,000 accelerators with 51.2 Tbps switches and 200/400 Gbps optical modules, allowing for extensive scalability.
Alibaba’s custom hardware includes 51.2 Tbps switches and programmable NICs, while software innovations like the Collective Communication Library (Echo) and Solo Multipath Transport Protocol enhance communication and reduce congestion.
Operationally, HPN 7.0 boasts high reliability with its dual-plane design, fast failure detection, and efficient rerouting. It has achieved 96% linear scaling efficiency, 98.22% cluster availability, and improvements in training throughput and collective communication performance.
领英推荐
Solutions for High Network Reliability in FE/BE for Scalable AI Training
In their talk, Jose Leitao and Robert Colantuoni from Meta discussed how the company has improved the reliability and availability of its Frontend (FE) and Backend (BE) networks to support large-scale AI training. They outlined Meta’s advanced monitoring and automated repair strategies designed to maintain optimal network performance for AI workloads.
The Frontend (FE) network manages data ingestion and communication libraries, crucial for smooth AI training operations, while the Backend (BE) network facilitates low-latency, high-bandwidth GPU-to-GPU communication, enhancing real-time data sharing. Separating these networks allows for better management and optimization of GPU-intensive tasks.
Maintaining network reliability involves addressing challenges like packet loss and hardware failures, which can significantly affect AI job performance and completion. Reductions in backend network capacity, due to maintenance or failures, can also disrupt training operations.
To enhance monitoring and triage, Meta uses a dual approach with passive monitoring (collecting data from SNMP and vendor-specific counters) and active monitoring (measuring packet loss and latency). On-box agents provide real-time data collection, improving issue detection accuracy and reducing alert noise. Context-aware monitoring focuses on different network scopes for better event correlation and diagnostics.
For repairs, Meta has automated the triage and repair process, capturing detailed network snapshots for faster troubleshooting. Strict Service Level Objectives (SLOs) for repair times are in place to minimize downtime and performance setbacks. Safe capacity buffers ensure network stability during repairs, and SLOs drive the automation of repair processes, ensuring timely and efficient resolution.
Scheduler + Sharding Considerations for Network Efficiency
In their session, Weiwei Chu and Arnab Choudhury from Meta discussed how they have customized the Meta Advanced Scheduler Tool (MAST) to enhance network communication for training large language models (LLMs) like LLaMA 3. They highlighted the importance of sharding and network efficiency at scale, emphasizing how Meta optimizes scheduling and parallelism to improve GPU utilization and reduce training time.
The push towards Artificial General Intelligence (AGI) and the complexities of models like LLaMA 3 require substantial compute and networking resources. Meta addressed this by building a 24,000-GPU network for large-scale training. As models increase in complexity, challenges such as network latency and GPU utilization become critical. Meta’s optimization of the scheduler and network alignment significantly reduced LLaMA 3’s training time.
Parallelism techniques are crucial for efficiency. Fully Sharded Data Parallelism (FSDP) splits model weight matrices across GPUs, reducing memory demands and using pre-fetching to hide communication latency. Tensor Parallelism (TP), which requires gathering outputs across GPUs, introduces high communication overhead and needs high-bandwidth connections. Pipeline Parallelism (PP) and other methods are used based on model size and network needs, with correct parallelism configurations essential for minimizing network overhead.
For LLaMA 3, Meta strategically layered parallelism techniques. Tensor parallelism was placed in the innermost layer for high-bandwidth communication within the same rack, while FSDP was positioned in the outermost layer to manage cross-AI-zone communication. Improper layering would result in slower training and increased latency.
Effective scheduling is also vital. Rank assignment, which determines the communication proximity of GPUs, must align with network topology to reduce latency. A well-designed scheduler like MAST optimizes GPU placement by understanding network hierarchy, ensuring that communication remains within low-latency zones. Poor rank assignment can cause increased latency and slower training, while optimal placement enhances performance.
Meta also addresses scheduling challenges with dedicated GPU buffers for fault tolerance, ensuring jobs continue smoothly in case of GPU failures. Additionally, they have focused on minimizing scheduling overhead and recovery time from host failures to maintain efficient training operations.
Designing Scalable Networks for Large AI Clusters: Challenges + Key Insights
Jithin Jose from Microsoft discussed designing scalable networks for large AI clusters to support demanding applications like autonomous driving and medical imaging. As AI models grow in complexity, efficient, high-performance networks are crucial. Microsoft has made strides in scaling AI clusters, reaching milestones like managing 80,000 MPI jobs and securing top rankings in supercomputing.
Key challenges include designing network topologies that manage GPU communication efficiently and avoiding performance issues from high oversubscription. Microsoft explored various designs, such as multiple planes and hybrid models, to improve efficiency and scalability. They also focus on network flexibility to adapt to future workloads.
Validating large clusters without disrupting ongoing jobs is critical. Microsoft developed benchmarks for smooth integration of new segments. Efficient routing is essential for performance, and Microsoft optimized path selection and bandwidth utilization to address peak performance issues.
Communication libraries are fine-tuned to improve performance and adapt to network changes, especially for long-distance communication. To ensure network reliability, Microsoft addressed link failures with smart switches and managed network asymmetry by providing feedback to source nodes to prevent congestion.
The Super Bench tool helps validate clusters by identifying issues, and Microsoft uses different topologies for public and dedicated AI clusters, optimizing each for its specific needs. Additionally, the smart communication library (CCL) approach leverages real-time feedback to adapt to network conditions, enhancing overall performance.
Faster Than Fast: Networking/Communication Optimizations for LLaMA 3
In their session, Pavan Balaji and Adi Gangidi from Meta discussed optimizing network and communication infrastructure to support large-scale generative AI models like LLaMA 3. They highlighted the need for infrastructure to handle the massive compute and network demands of such models, detailing optimizations for network latency, communication libraries, and routing to enhance performance for LLaMA 3’s training and serving.
Meta developed new clusters with 24,000 GPUs to support LLaMA 3, addressing challenges from previous models. Key issues included network latency sensitivity, load balancing inefficiencies, and the lack of network topology awareness in scheduling, which led to bottlenecks. To improve performance, Meta implemented flow multiplexing, optimized communication libraries by prioritizing critical control messages, and adjusted channel buffers to balance latency and bandwidth.
For LLaMA 3, Meta achieved a 10% boost in all-gather collective operations and focused on optimizing time to first token and time to incremental token for serving. They used flat algorithms for all-reduce communication to reduce latency, aiming to balance bandwidth and latency for optimal user experience.
Conversations with Industry Professionals: Learning Beyond the Talks
Engaging with Experts:
Throughout the event, both of us seized the opportunity to connect with various professionals across the networking and AI infrastructure space. As college students eager to learn, we wanted to gain deeper insights into the real-world challenges and opportunities in these fields. Approaching industry experts allowed us to understand not just their career paths but also the day-to-day responsibilities and skills required to succeed.
We had insightful conversations with individuals such as Arihant Jain from Arista Networks and Hany Morsy from Meta, who were working together on a joint project. They explained how companies collaborate, each bringing their expertise to solve complex problems. They stressed that while certifications have their place, hands-on projects are a far better demonstration of one’s capabilities, especially when breaking into a career in networking and AI infrastructure.
Career Advice and Insights:
Both of us were advised to focus less on piling up certifications and more on practical experiences, like building and deploying real-world projects. This was a recurring piece of advice, whether we were speaking with Arihant and Hany or Kalimani Venkatesan and Manish Aggarwal from Capgemini. They emphasized that the real value lies in understanding the application of networking concepts, particularly in modern environments like Kubernetes, which is widely used today for network orchestration. Rather than sticking to older networking fundamentals, they encouraged us to dive into deploying networks using Kubernetes and other cloud-based solutions.
When we spoke to Adi Gangidi from Meta and Vishal Gadiya from Infinera, they reiterated the importance of keeping up with general networking concepts while also integrating AI. They suggested that the ability to combine fundamental networking skills with emerging AI technologies would open doors in both fields, offering a unique edge in the job market.
Further Career Guidance:
Talking to Masiuddin Mohammed from Cisco Systems offered a different perspective. He shared his journey from starting in networking to eventually moving into sales, highlighting the importance of exploring different roles and career paths. His advice to "try everything" resonated with us, as he suggested that finding the right fit often requires experimentation with various technical and non-technical roles.
We also gained valuable insights from Omar Baldonado from Meta, who stressed the importance of mastering foundational computer science concepts like operating systems, data structures, and software engineering practices. He mentioned that having a solid technical base is crucial, but with the current market shift, it’s important to have projects that combine AI and data science to stay relevant.
Conclusion
Attending the event provided invaluable insights into the complexities of networking and AI infrastructure, particularly at scale. The presentations from industry leaders like Meta and Microsoft highlighted cutting-edge techniques used to optimize massive AI models such as LLaMA 3. Key takeaways included how companies manage network latency, GPU utilization, and sharding techniques to enhance performance in large-scale AI training. Innovations like layered parallelism, network-aware scheduling, and flow multiplexing illustrated the delicate balance required to push AI limits while maintaining a robust infrastructure. These talks demonstrated how both hardware and software optimizations directly drive AI advancements and accelerate training times.
Conversations with industry professionals enriched our understanding by bridging the gap between theoretical learning and practical application. Speaking with experts like Arihant, Hany, Omar, Kalimani, Manish, Adi, Vishal, and Masiuddin provided valuable career advice. Their emphasis on gaining hands-on experience over certifications highlighted the growing demand for real-world problem-solving skills in networking and AI infrastructure. These interactions gave us a deeper understanding of how collaboration and practical knowledge are key to solving complex technical challenges in the industry, offering clarity on the skills and expertise needed to thrive.
The event emphasized the critical role that practical experience plays in mastering networking and AI infrastructure. The discussions on optimizing parallelism, network design, and fault tolerance underscored the importance of applying theoretical knowledge to real-world scenarios. Moving forward, our focus will be on hands-on projects that explore these optimizations, particularly in areas like scheduling, parallelism techniques, and infrastructure design. The insights gained from both the talks and conversations with industry professionals have inspired us to delve deeper into emerging technologies such as Kubernetes and AI-driven networking solutions, helping us better align our academic and career pursuits with the evolving demands of the industry.
?? Aspiring Entrepeneur and Digital Marketer | Helping Others Through Business | Google Ads, Meta Ads, Email & Content marketing & mechanical engineering
3 周Awesome!
Pretty good write up. Thanks for sharing
Biomedical Engineering | Bioinformatics @ UC Irvine | Undergraduate Researcher
1 个月This is great Satvik! Excited to read the article!
Student at University of California, Santa Cruz
1 个月Good stuff Tharuka! proud of you man
Management Information Systems(MIS) Student at San Jose State University | Prev.Business Analyst Intern @ First Force Technologies
1 个月Congratulations Satvik! This is an excellent article and opportunity! I like how you guys went deep into their points and clearly outlined the takeaways from each speaker. I also learned a lot about META that I didn't know before.