Our Experience at the Meta’s Networking@Scale Event

Our Experience at the Meta’s Networking@Scale Event

Co-authored by Tharuka Kodituwakku

Who We Are + Why We Attended

  • Hey, I’m Satvik, a student at UC San Diego studying Math and Computer Science. While most of my experience has been with machine learning and full stack development, I’ve also had some exposure to networking through my coursework. However, I haven’t had the chance to dive deep into it before. That’s why attending events like this is so exciting for me—it's a perfect opportunity to explore new areas and see firsthand how networking plays a crucial role in AI infrastructure. Since this is my first time covering an event like this, I thought it would be a great idea to start a blog and share what I’ve learned. Connecting with professionals here has been incredibly valuable, and I’m excited to keep growing and exploring more areas of tech!
  • Hi there, I’m Tharuka! I’m a student at UC Santa Cruz studying Computer Engineering. My primary interests are in robotics and autonomous systems. I also have a hobbyist interest in networking, with several projects on smart home and self-hosted data management solutions for my family. I attended this event because I love to meet people who are experts in the field and passionate about their work in the computing industry. I wanted to learn more about the innovation that is happening in the industry in every facet. Networking technology advancement is of paramount importance in keeping up with the growing demand for AI/ML technology and its fast-improving capabilities. The best way to learn about this innovation is to connect with experts that have an intimate knowledge of the subject and a thirst to explore more. I can’t wait to explore every area of tech that’s innovating the way the networking industry is innovating.

Omar's Opening Remarks

AI as an End-to-End Problem: AI requires seamless, real-time performance across the network. Whether users are interacting with Meta AI on their phones or using AI-powered chatbots, the network must deliver instant responses to ensure a smooth experience from data centers to devices.

Great Operations: Effective management of large AI clusters is crucial. Even minor network failures can disrupt AI training at Meta’s scale, so quickly detecting and resolving issues is essential to avoid delays and keep systems running efficiently.

Models and Networking Co-design: AI model developers and network engineers must collaborate to address the unique challenges of running models on thousands of GPUs. This coordination ensures both sides align to optimize performance and scalability for Meta’s AI-driven services.

Evolving Meta’s Edge Architecture

  • Speakers: Shivkumar C. , Lee Hetherington
  • Written by: Tharuka Kodituwakku
  • Presentation Overview:

In their presentation, Shivkumar Chandrashekhar and Lee Hetherington discussed Meta’s transformation of its Edge CDN and Edge Cloud infrastructure to meet the rising demand for AI-generated, non-cacheable content. As AI and metaverse applications grow, Meta is evolving its architecture to support real-time, low-latency interactive experiences.

They traced the evolution of Meta’s content delivery from text-based media to immersive experiences, noting that AI-generated content adds complexity because it cannot be cached. Meta’s global Edge infrastructure includes Edge Metros, Meta Network Appliance (mNA) clusters, and an Edge Backbone that connects to its main data centers.

The speakers highlighted two challenges: the need for real-time processing of non-cacheable AI content and the ultra-low-latency requirements of metaverse and gaming applications. To address these, Meta is transitioning to a decentralized compute platform that brings computing power closer to users. This includes new hardware, like GPUs, to support diverse tasks.

With this re-architecture, security and management challenges arise. Meta is strengthening security by isolating Edge hosts and enhancing authentication protocols, while also improving fleet management with tools like a demand forecasting system and edge auto-scaler to dynamically allocate resources.

  • Takeaways: This talk provided a clear understanding of how Meta is rethinking its edge infrastructure to accommodate the unique demands of AI and the metaverse. The key takeaway was the importance of bringing compute resources closer to the user to reduce latency and improve real-time interactivity. Additionally, the speakers highlighted the complex security and resource management issues that arise in a globally distributed network.
  • Personal Reflections: The presentation underscored the fascinating intersection of networking, AI, and user experience. As someone interested in robotics and AI, I found the discussion particularly inspiring, as it opened my eyes to the potential of edge computing to enhance real-time applications. The challenges faced by Meta in this space sparked my curiosity to learn more about how edge architectures can support the future of AI and immersive experiences.

Future Challenges for HPC Networking

In his presentation, Ron Brightwell from Sandia National Laboratories explored the evolution of high-performance computing (HPC) networks over the past 30 years, focusing on the convergence of HPC with cloud computing, hyperscalers, and AI/ML workloads. He discussed both advancements and future challenges in HPC networking hardware and software.

Ron traced the development of HPC node architectures, from the simple Intel Paragon of 30 years ago to the complex HPE Frontier systems of today. Early nodes had tightly integrated components, while modern nodes feature CPUs, GPUs, and various memory types but still face coherence issues between network interfaces and processors. The rise of Ethernet-compatible networks like HPE Slingshot is also notable.

With increasing node complexity, integrating GPUs and standardizing network communication across vendors like NVIDIA and AMD is challenging. The role of SmartNICs, effective in cloud environments but less suited for HPC’s message-based communication, was also discussed. Networking hardware trends include faster link speeds, with 400 Gbps networks introducing new processing challenges, and the potential of Ethernet and chiplet technology to customize and improve HPC networks.

On the software side, Ron highlighted a “semantic mismatch” between network capabilities and management protocols. While APIs like OpenFabrics Interfaces and UCX address these issues, they add complexity. Standardization is needed to simplify network programming and enhance portability.

Looking ahead, Ron identified several key challenges: reducing network latency through direct integration of interfaces with memory buses, managing increased parallelism in network communications, ensuring resilience and controlling congestion, and incorporating event-driven communication models to support dynamic workflows in HPC environments.

  • Takeaways: During the talk, I gained a deeper understanding of how high-performance computing (HPC) networking challenges are increasingly relevant to large-scale AI and machine learning workloads. The insights into managing vast data exchanges and ensuring low latency highlighted the importance of scalable and efficient networking solutions. Additionally, I recognized that both hardware and software advancements are critical to meeting future demands, particularly in areas like network design, integration with accelerators, and the development of more efficient APIs.
  • Personal Reflections: Before this talk, I hadn’t explored much about networking, especially in the context of high-performance computing. However, hearing about the real challenges in managing AI infrastructure made me realize how critical networking is to the performance and scalability of these systems. The discussion opened my eyes to the complexities involved, and it left me interested in learning more about how these technical innovations tie into large-scale AI projects. It was a solid introduction to an area I hadn’t considered before but now find intriguing.

AI Impact on the Backbone

Presentation Overview:

In their talk, Jyotsna Sundaresan and Abishek Gopalan from Meta discussed how the surge in AI demand from Meta’s apps (Facebook, Instagram, WhatsApp) has impacted their global backbone network. They highlighted the rapid increase in traffic driven by AI workloads and Meta’s strategies to manage this growth.

Initially, Meta expected AI traffic to remain within data centers, but the need for data replication, freshness, and cross-region inference led to a 30–50% rise in backbone traffic. They examined the AI traffic lifecycle, revealing that more data moves across regions than anticipated, particularly due to cross-region inference.

To control traffic growth, Meta has implemented several strategies:

Efficient Workload Placement: Optimizing the placement of AI workloads and data storage to reduce cross-region traffic.

Scheduled Bulk Transfers: Timing large data transfers during off-peak hours to minimize peak traffic impact.

Quality of Service (QoS) Initiatives: Prioritizing AI workloads to ensure critical services remain operational.

Building Larger Buffers: Increasing buffer capacity to handle unexpected traffic spikes and maintain consistent service.

  • Takeaways: Meta’s backbone has experienced significant growth due to AI-driven traffic, requiring new strategies to handle the demand for data replication, freshness, and inference across regions. Through workload optimization, scheduled data transfers, and quality of service improvements, Meta has managed to control the rapid growth of AI traffic while preparing for future increases.
  • Personal Reflections: This talk provided a clear understanding of how AI impacts networking infrastructure on a large scale. The focus on optimizing workload placement and data movement aligns with real-world challenges that arise in fields like robotics, where data must be efficiently processed across distributed systems.

Alibaba High-Performance Networking

  • Speaker: Jiaqi Gao
  • Written by: Satvik Matta
  • Presentation Overview:

In his presentation, Jiaqi Gao from Alibaba introduced the HPN 7.0 network architecture, tailored for the specific demands of large language model (LLM) training. He highlighted the unique challenges posed by LLMs compared to traditional cloud computing workloads and how HPN 7.0 addresses these issues to enhance AI infrastructure.

LLM training creates a few large, bursty data flows up to 400 Gbps per host, in contrast to the numerous smaller flows typical in traditional cloud workloads. The need for global synchronization across the cluster and the low entropy traffic from fewer hosts lead to inefficiencies and high network pressure.

HPN 7.0’s 2-tier, dual-plane design replaces the traditional 3-tier Clos architecture, connecting up to 15,000 GPUs within a single Pod. This design reduces complexity and efficiently manages large data flows while avoiding congestion. The dual-ToR (Top of Rack) setup ensures redundancy and prevents single points of failure.

Key features include a multi-rail topology that connects each accelerator to a dedicated switch, and a front-backend separation that optimizes CPU and accelerator traffic. The network supports up to 130,000 accelerators with 51.2 Tbps switches and 200/400 Gbps optical modules, allowing for extensive scalability.

Alibaba’s custom hardware includes 51.2 Tbps switches and programmable NICs, while software innovations like the Collective Communication Library (Echo) and Solo Multipath Transport Protocol enhance communication and reduce congestion.

Operationally, HPN 7.0 boasts high reliability with its dual-plane design, fast failure detection, and efficient rerouting. It has achieved 96% linear scaling efficiency, 98.22% cluster availability, and improvements in training throughput and collective communication performance.

  • Takeaways: Alibaba’s HPN 7.0 network architecture addresses the unique demands of large language model training, particularly the need to handle periodic, bursty traffic and maintain high availability during long-running jobs. By introducing a 2-tier, dual-plane design and custom hardware solutions, HPN 7.0 delivers consistent high performance, scalability, and fault tolerance for AI training workloads at exascale.
  • Personal Reflections: This talk highlighted how LLM training introduces unique networking challenges compared to traditional workloads, requiring specialized infrastructure to manage large data flows and ensure synchronization across clusters. It was fascinating to see how Alibaba’s HPN 7.0 architecture tackles these problems, especially the dual-plane design and custom hardware solutions. This level of detail shows the complexity of building data center networks capable of supporting AI at scale, which has broad implications for the future of AI infrastructure development.

Solutions for High Network Reliability in FE/BE for Scalable AI Training

  • Speakers: Jose Leitao , Robert Colantuoni
  • Written by: Tharuka Kodituwakku
  • Presentation Overview:

In their talk, Jose Leitao and Robert Colantuoni from Meta discussed how the company has improved the reliability and availability of its Frontend (FE) and Backend (BE) networks to support large-scale AI training. They outlined Meta’s advanced monitoring and automated repair strategies designed to maintain optimal network performance for AI workloads.

The Frontend (FE) network manages data ingestion and communication libraries, crucial for smooth AI training operations, while the Backend (BE) network facilitates low-latency, high-bandwidth GPU-to-GPU communication, enhancing real-time data sharing. Separating these networks allows for better management and optimization of GPU-intensive tasks.

Maintaining network reliability involves addressing challenges like packet loss and hardware failures, which can significantly affect AI job performance and completion. Reductions in backend network capacity, due to maintenance or failures, can also disrupt training operations.

To enhance monitoring and triage, Meta uses a dual approach with passive monitoring (collecting data from SNMP and vendor-specific counters) and active monitoring (measuring packet loss and latency). On-box agents provide real-time data collection, improving issue detection accuracy and reducing alert noise. Context-aware monitoring focuses on different network scopes for better event correlation and diagnostics.

For repairs, Meta has automated the triage and repair process, capturing detailed network snapshots for faster troubleshooting. Strict Service Level Objectives (SLOs) for repair times are in place to minimize downtime and performance setbacks. Safe capacity buffers ensure network stability during repairs, and SLOs drive the automation of repair processes, ensuring timely and efficient resolution.

  • Takeaways: By enhancing both frontend and backend networks, Meta has successfully built a robust infrastructure capable of supporting large-scale AI training. The use of dual monitoring strategies, automated repair processes, and proactive capacity management ensures high availability and performance for AI workloads. Meta’s emphasis on automation and precise monitoring has transformed network failures from major setbacks into manageable events, helping maintain consistent performance for GPU-intensive operations.
  • Personal Reflections: This talk provided valuable insights into the complexities of managing large-scale networks for AI training. The dual monitoring strategy and automated repair processes were particularly interesting, showcasing how proactive measures can ensure network stability even under heavy loads. It reinforced the importance of reliability and performance in AI infrastructure, especially in scenarios involving real-time data sharing across GPUs.

Scheduler + Sharding Considerations for Network Efficiency

  • Speakers: Weiwei Chu, Arnab Choudhury
  • Written by: Satvik Matta
  • Presentation Overview:

In their session, Weiwei Chu and Arnab Choudhury from Meta discussed how they have customized the Meta Advanced Scheduler Tool (MAST) to enhance network communication for training large language models (LLMs) like LLaMA 3. They highlighted the importance of sharding and network efficiency at scale, emphasizing how Meta optimizes scheduling and parallelism to improve GPU utilization and reduce training time.

The push towards Artificial General Intelligence (AGI) and the complexities of models like LLaMA 3 require substantial compute and networking resources. Meta addressed this by building a 24,000-GPU network for large-scale training. As models increase in complexity, challenges such as network latency and GPU utilization become critical. Meta’s optimization of the scheduler and network alignment significantly reduced LLaMA 3’s training time.

Parallelism techniques are crucial for efficiency. Fully Sharded Data Parallelism (FSDP) splits model weight matrices across GPUs, reducing memory demands and using pre-fetching to hide communication latency. Tensor Parallelism (TP), which requires gathering outputs across GPUs, introduces high communication overhead and needs high-bandwidth connections. Pipeline Parallelism (PP) and other methods are used based on model size and network needs, with correct parallelism configurations essential for minimizing network overhead.

For LLaMA 3, Meta strategically layered parallelism techniques. Tensor parallelism was placed in the innermost layer for high-bandwidth communication within the same rack, while FSDP was positioned in the outermost layer to manage cross-AI-zone communication. Improper layering would result in slower training and increased latency.

Effective scheduling is also vital. Rank assignment, which determines the communication proximity of GPUs, must align with network topology to reduce latency. A well-designed scheduler like MAST optimizes GPU placement by understanding network hierarchy, ensuring that communication remains within low-latency zones. Poor rank assignment can cause increased latency and slower training, while optimal placement enhances performance.

Meta also addresses scheduling challenges with dedicated GPU buffers for fault tolerance, ensuring jobs continue smoothly in case of GPU failures. Additionally, they have focused on minimizing scheduling overhead and recovery time from host failures to maintain efficient training operations.

  • Takeaways: Meta significantly enhanced large-scale training efficiency for LLaMA 3 by implementing network-aware parallelism and scheduling. By carefully tuning these aspects, Meta achieved over 40% GPU utilization. Topology-aware scheduling ensured that ranks were assigned based on network topology, optimizing tensor parallelism and reducing communication overhead. Additionally, Meta incorporated fault tolerance measures to handle hardware failures efficiently, ensuring that training jobs remain resilient and continue smoothly despite infrastructure challenges.
  • Personal Reflections: This talk highlighted the importance of network-aware scheduling and parallelism for training large models like LLaMA 3. The technical deep dive into how Meta optimizes its Job Scheduler (MAST) for efficient communication across GPUs was insightful, demonstrating how infrastructure decisions impact AI training performance at scale.

Designing Scalable Networks for Large AI Clusters: Challenges + Key Insights

  • Speaker: Jithin Jose
  • Written by: Tharuka Kodituwakku
  • Presentation Overview:

Jithin Jose from Microsoft discussed designing scalable networks for large AI clusters to support demanding applications like autonomous driving and medical imaging. As AI models grow in complexity, efficient, high-performance networks are crucial. Microsoft has made strides in scaling AI clusters, reaching milestones like managing 80,000 MPI jobs and securing top rankings in supercomputing.

Key challenges include designing network topologies that manage GPU communication efficiently and avoiding performance issues from high oversubscription. Microsoft explored various designs, such as multiple planes and hybrid models, to improve efficiency and scalability. They also focus on network flexibility to adapt to future workloads.

Validating large clusters without disrupting ongoing jobs is critical. Microsoft developed benchmarks for smooth integration of new segments. Efficient routing is essential for performance, and Microsoft optimized path selection and bandwidth utilization to address peak performance issues.

Communication libraries are fine-tuned to improve performance and adapt to network changes, especially for long-distance communication. To ensure network reliability, Microsoft addressed link failures with smart switches and managed network asymmetry by providing feedback to source nodes to prevent congestion.

The Super Bench tool helps validate clusters by identifying issues, and Microsoft uses different topologies for public and dedicated AI clusters, optimizing each for its specific needs. Additionally, the smart communication library (CCL) approach leverages real-time feedback to adapt to network conditions, enhancing overall performance.

  • Takeaways: Network design is crucial for large-scale AI training, requiring efficient topology and routing to minimize latency, handle vast data flows, and adapt to failures. Tuning communication libraries for specific network topologies significantly impacts training performance, especially as clusters expand and face new challenges like long-distance communication. Cluster validation is an ongoing process; tools like Microsoft’s Super Bench help in targeting problematic nodes and links to ensure performance. Flexibility and adaptability in network design and communication libraries are essential as AI models and workloads evolve, ensuring that clusters remain scalable and efficient.
  • Personal Reflections: This talk provided valuable insights into the challenges of designing and scaling networks for large AI training clusters. The level of detail on routing, communication libraries, and network reliability highlighted the complexity of managing large-scale distributed systems. Microsoft's approach to building flexible, high-performance networks that can handle the demands of AI workloads was impressive, offering a roadmap for future developments in this space.

Faster Than Fast: Networking/Communication Optimizations for LLaMA 3

In their session, Pavan Balaji and Adi Gangidi from Meta discussed optimizing network and communication infrastructure to support large-scale generative AI models like LLaMA 3. They highlighted the need for infrastructure to handle the massive compute and network demands of such models, detailing optimizations for network latency, communication libraries, and routing to enhance performance for LLaMA 3’s training and serving.

Meta developed new clusters with 24,000 GPUs to support LLaMA 3, addressing challenges from previous models. Key issues included network latency sensitivity, load balancing inefficiencies, and the lack of network topology awareness in scheduling, which led to bottlenecks. To improve performance, Meta implemented flow multiplexing, optimized communication libraries by prioritizing critical control messages, and adjusted channel buffers to balance latency and bandwidth.

For LLaMA 3, Meta achieved a 10% boost in all-gather collective operations and focused on optimizing time to first token and time to incremental token for serving. They used flat algorithms for all-reduce communication to reduce latency, aiming to balance bandwidth and latency for optimal user experience.

  • Takeaways: Meta aims to scale up to even larger AI models, which will present new challenges such as deciding between maintaining a lossless network or adopting lossy protocols due to increased network latency. As domain sizes grow, there will be a need for shared memory-based networks with enhanced reliability and a focus on optimizing network performance for smaller message sizes, particularly for inference workloads. Ongoing work involves optimizing networking and communication libraries for both training and inference, with innovations in network topology, flow multiplexing, and latency reduction preparing Meta for the demands of future, larger models.
  • Personal Reflections: This talk provided a deep dive into the complexities of optimizing network infrastructure for generative AI models like LLaMA 3. The emphasis on balancing both network bandwidth and latency to achieve high performance in both training and serving was particularly insightful. As AI models continue to grow in size and complexity, the innovations discussed in this talk will be critical for enabling the next generation of AI infrastructure.

Conversations with Industry Professionals: Learning Beyond the Talks

Engaging with Experts:

Throughout the event, both of us seized the opportunity to connect with various professionals across the networking and AI infrastructure space. As college students eager to learn, we wanted to gain deeper insights into the real-world challenges and opportunities in these fields. Approaching industry experts allowed us to understand not just their career paths but also the day-to-day responsibilities and skills required to succeed.

We had insightful conversations with individuals such as Arihant Jain from Arista Networks and Hany Morsy from Meta, who were working together on a joint project. They explained how companies collaborate, each bringing their expertise to solve complex problems. They stressed that while certifications have their place, hands-on projects are a far better demonstration of one’s capabilities, especially when breaking into a career in networking and AI infrastructure.

Career Advice and Insights:

Both of us were advised to focus less on piling up certifications and more on practical experiences, like building and deploying real-world projects. This was a recurring piece of advice, whether we were speaking with Arihant and Hany or Kalimani Venkatesan and Manish Aggarwal from Capgemini. They emphasized that the real value lies in understanding the application of networking concepts, particularly in modern environments like Kubernetes, which is widely used today for network orchestration. Rather than sticking to older networking fundamentals, they encouraged us to dive into deploying networks using Kubernetes and other cloud-based solutions.

When we spoke to Adi Gangidi from Meta and Vishal Gadiya from Infinera, they reiterated the importance of keeping up with general networking concepts while also integrating AI. They suggested that the ability to combine fundamental networking skills with emerging AI technologies would open doors in both fields, offering a unique edge in the job market.

Further Career Guidance:

Talking to Masiuddin Mohammed from Cisco Systems offered a different perspective. He shared his journey from starting in networking to eventually moving into sales, highlighting the importance of exploring different roles and career paths. His advice to "try everything" resonated with us, as he suggested that finding the right fit often requires experimentation with various technical and non-technical roles.

We also gained valuable insights from Omar Baldonado from Meta, who stressed the importance of mastering foundational computer science concepts like operating systems, data structures, and software engineering practices. He mentioned that having a solid technical base is crucial, but with the current market shift, it’s important to have projects that combine AI and data science to stay relevant.

Conclusion

Attending the event provided invaluable insights into the complexities of networking and AI infrastructure, particularly at scale. The presentations from industry leaders like Meta and Microsoft highlighted cutting-edge techniques used to optimize massive AI models such as LLaMA 3. Key takeaways included how companies manage network latency, GPU utilization, and sharding techniques to enhance performance in large-scale AI training. Innovations like layered parallelism, network-aware scheduling, and flow multiplexing illustrated the delicate balance required to push AI limits while maintaining a robust infrastructure. These talks demonstrated how both hardware and software optimizations directly drive AI advancements and accelerate training times.

Conversations with industry professionals enriched our understanding by bridging the gap between theoretical learning and practical application. Speaking with experts like Arihant, Hany, Omar, Kalimani, Manish, Adi, Vishal, and Masiuddin provided valuable career advice. Their emphasis on gaining hands-on experience over certifications highlighted the growing demand for real-world problem-solving skills in networking and AI infrastructure. These interactions gave us a deeper understanding of how collaboration and practical knowledge are key to solving complex technical challenges in the industry, offering clarity on the skills and expertise needed to thrive.

The event emphasized the critical role that practical experience plays in mastering networking and AI infrastructure. The discussions on optimizing parallelism, network design, and fault tolerance underscored the importance of applying theoretical knowledge to real-world scenarios. Moving forward, our focus will be on hands-on projects that explore these optimizations, particularly in areas like scheduling, parallelism techniques, and infrastructure design. The insights gained from both the talks and conversations with industry professionals have inspired us to delve deeper into emerging technologies such as Kubernetes and AI-driven networking solutions, helping us better align our academic and career pursuits with the evolving demands of the industry.

Gurshan Brar

?? Aspiring Entrepeneur and Digital Marketer | Helping Others Through Business | Google Ads, Meta Ads, Email & Content marketing & mechanical engineering

3 周

Awesome!

回复

Pretty good write up. Thanks for sharing

Sai Karthik Puvvula

Biomedical Engineering | Bioinformatics @ UC Irvine | Undergraduate Researcher

1 个月

This is great Satvik! Excited to read the article!

回复
Surendra Jammishetti

Student at University of California, Santa Cruz

1 个月

Good stuff Tharuka! proud of you man

Neeraj Chandra Penumaka

Management Information Systems(MIS) Student at San Jose State University | Prev.Business Analyst Intern @ First Force Technologies

1 个月

Congratulations Satvik! This is an excellent article and opportunity! I like how you guys went deep into their points and clearly outlined the takeaways from each speaker. I also learned a lot about META that I didn't know before.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了