Transforming the Data Center with AI

Transforming the Data Center with AI

Reflecting on my career, I feel extremely lucky to have lived through several key inflection points in networking. From when I landed at Stanford the same month Netscape went IPO in September 1995, which many believe marked the beginning of the Internet generation, to the launch of the smartphone generation in the mid 2000s, to right now, when AI is fundamentally changing how we all live, work and play. Every few years, a new trend has emerged that has made the network that much more critical to our everyday lives. This meant the network had to continue scaling exponentially in all dimensions – in terms of bandwidth, number of geographical areas, coverage and number of connections. Today, the network is more capable, and more critical, than at any time in the past.

Juniper’s Evolution ?

Juniper Networks played an important role in the evolution of networking over time. It was founded in 1996 to deliver the network routing infrastructure needed by service providers to support the fast-growing demands of the Internet. Since that time, the company has established itself as a premier networking provider.

Along the way, Juniper’s path merged with two emerging startups, which expanded Juniper’s capabilities in several key areas.

The first one was Mist Systems, a company founded to bring the simplicity and scale of AI operations (AIOps) via the cloud to wireless networking. Mist AI was revolutionary, helping enterprises of all sizes deliver game changing agility, automation and assurance for better wireless experiences. They showed that AI and the cloud go hand in hand, as the former requires enormous data collection and processing, and the latter delivers an elastic infrastructure for handling that.?

After Mist was acquired by Juniper in 2019, the benefits of Mist were extended to the entire Juniper campus and branch portfolio. For the first time, wired, SD-WAN and NAC were all unified under a common Mist AI and cloud umbrella, delivering exceptional end-to-end experiences. Fast forward to today, Mist has thousands of customers, and has collected Terabytes of data, which means that it will continue to benefit from its multi-year lead over competing AI-based infrastructure solutions.

The second company was the one I co-founded, Apstra. At Apstra, we were focusing on a different market, the data center. We wanted to give IT departments software that could help them run their data center networks at the speed of business, with the highest levels of reliability and lowest operational costs. In order to deliver on this goal, we had to invent a solution from the ground up, which we called intent-based networking.?

With Apstra’s intent-based networking software, network topologies, components, protocols, etc., are all modeled across any vendor’s platforms. Using these models, the Apstra solution automates deterministically, and with utmost reliability, all aspects of designing, building, deploying, and operating a data center. This has numerous benefits, with some customers seeing up to 85% improvement in deployment time, a 90% reduction in data center networking operating expenses, and 10x better reliability. Being the only multivendor solution for intent-based networking, Apstra also avoids expensive vendor lock-in.

Like Mist Systems, Apstra was also acquired by Juniper (in 2020). While Mist AI served as the foundational pillar for Juniper’s campus and branch domain, Apstra with its intent-based networking software became the foundation for Juniper’s automated, secure data center solution. Together, the two solutions enabled Juniper to focus on its "true north” mission of delivering the best end-to-end user and operator experiences, a concept referred to as “experience- first”.

Bringing Mist and Apstra Together to Deliver Industry-First Transformative Outcomes?

The Mist and Apstra approaches, while different, are also quite complementary. I wrote a blog some time ago about the difference between deterministic and probabilistic approaches to automation – citing intent-based networking as a prime example of deterministic automation. I contrasted deterministic approaches to AI, which is probabilistic in nature.??

To recap that discussion, deterministic approaches are required when current actions can completely be determined by previously existing causes, like with an auto-pilot on a plane, or when running a data center network with 5 9s of reliability.

On the other hand, probabilistic approaches are required when one wants to use the insights of the historical past to help predict the future, e.g. to predict the probability of a component like a transceiver failing, or to help place access points (APs) or utilize wireless channels in a way that maximizes the probability of good user experience across the board. In addition, probabilistic insights have proven to be far superior when it comes to developing large language models (LLMs) for Generative AI.???

As I argued in my last blog, probabilistic and deterministic insights, while very different, can be highly complementary. For example, while operating a network with intent-based networking, we’re collecting masses of data that are used to validate whether the network indeed meets intent, and whether a change took place as expected. This is data that can indeed be pushed to a centralized, cloud-based data lake, and contribute to the cumulative wealth of data that is fed to the AI models.

As proof of the synergies in these models, today we’re announcing Juniper is bringing AIOps to the data center by integrating Apstra with our beloved Marvis virtual network assistant (VNA). More specifically, we are extending Apstra’s rich streaming telemetry and real-time monitoring into the Marvis VNA dashboard to provide proactive troubleshooting and analytics across operational domains. This integration is a fundamental first step to provide future AI-native actionable insights for multivendor data centers. Also, campus and branch operators using Marvis can now see data center issues on the same dashboard.

In addition, we’re bringing Marvis’ conversational interface (driven by Generative AI) to the data center – a fitting addition to the Apstra intent-based approach. Indeed, intent-based networking’s mantra is “the network operators tells the software what to do, and the software goes ahead and does it.” Why not tell it in plain English? This is our long-term goal with Marvis VNA for the data center.??

For the first time ever, as part of Juniper’s AI-Native Networking Platform, we’re bringing together AI, networking, and intent-based networking. I believe this is a significant first in the industry, a step that is sure to create transformative outcomes for everyone that operates and depends on our critical networks.

The Latest Inflection Point – AI Data Centers

In 2022, AI took the world by storm with the introduction of Chat GPT. As the possibilities of AI became clear to all, a new market for AI data centers seemingly appeared almost overnight. The numbers for this space are staggering. According to the 650 Group, AI workloads expected to be responsible for $2.4B in data center networking spend in 2024, out of a $22B market. But even more importantly, the industry expects a 52% CAGR over the next 5 years as every business invests in building AI capabilities that they foresee will become critical to their competitiveness in the market. Customers today have one topic that is top of mind, artificial intelligence (AI). And in talking to our customers, it’s clear that they’re well beyond the exploratory phase. They are investing meaningfully in AI applications with initiatives tied to corporate performance.

Given the investments and expertise Juniper has built in both networking and AI, I can’t think of a better vendor to deliver the networking solutions that make up the core of infrastructure supporting AI workloads. But, as the product leader for Juniper in this space, I am also acutely aware of the challenges. Reassuringly, we’ve designed our AI data center offering to uniquely address them. Here’s how:

1. An operations-first approach makes it easier to operate AI clusters with fewer resources?

Graphics Processing Units (GPUs) are expensive, and AI clusters are complex and expensive to build.

Optimizing GPU efficiency for the lowest job completion time (JCT) is critical to contain AI costs, but these expensive resources are typically underutilized. Last year for example, Meta found that 33% of the elapsed time in AI/ML training was spent waiting for the network.?

It is indeed difficult to extract all the performance from AI clusters. GPU horsepower, CPU horsepower, memory, storage and bandwidth need to be optimized carefully and dynamically to avoid bottlenecks. When created, bottlenecks can have devastating effects on run completion times, which could delay AI training by days, if not weeks.

Similarly for inference, any bottlenecks can create delays in returning results for end users, severely affecting user experience.

In order to control completion times and inference delays, dynamic parameter tuning, deep observability and debuggability are critical capabilities. This is why at Juniper we’re taking an operations-first approach, leveraging the full power of Apstra and, optionally Marvis VNA, to run these clusters optimally.

An operations-first approach provides simple and?seamless operator experiences that save time and?money without vendor lock-in. Apstra is the only open, intent-based, multivendor, DC automation platform for Day 0, Day 1, and Day 2+ lifecycle management. With templated blueprints and IBN programming, Apstra delivers reliability, consistency and repeatability that translates into accelerated time to deployment, simplified operations, and an average 320% ROI for customers.

Apstra intent-based networking ensures the cluster network is designed properly. It ensures that the network parameters, including QoS, load balancing and congestion control are set up properly to avoid bottlenecks. Apstra also collects fine-grained telemetry, including real-time queue depth to continually validate performance, and has the ability to react and take corrective action through its intent-based analytics tests.

2. Open, AI-Optimized Ethernet avoids expensive lock in?

Soaring demand for powerful GPUs and single-sourced, proprietary InfiniBand networking has fueled steep pricing and supply chain bottlenecks. Nvidia’s innovations and dominant AI market leadership are to be applauded, but to fuel innovation while driving down costs the industry needs competition to reach mass market.

To ease the reliance on Nvidia, the industry has made significant moves to foster an open, competitive market with GPU diversity and the most widely deployed L2 technology in the world, Ethernet. Those efforts are paying off. Reaching an important milestone, the release of PyTorch 2.0, the leading AI developer framework, extends the vast AI ecosystem of PyTorch to AMD, Intel, and other GPU vendors, enabling AI users to design GPU agnostic systems. This changes the game when it comes to both cost and supply. It also makes it that much more compelling to use standard, widespread technologies like Ethernet instead of Infiniband.

At Juniper, we have always supported open solutions. We’ve always been active in standards, such as EVPN-VXLAN or MPLS. More recently, Juniper has joined the Ultra Ethernet Consortium (UEC) to accelerate the development of a common, high-performance Ethernet architecture for multivendor AI networks. Further, Apstra’s multivendor support for the most popular data center switching vendors eliminates vendor lock-in, which provides our customers with powerful strategic flexibility.?

Finally, enabling the Ethernet ecosystem, and therefore freeing the path towards GPU and vendor-agnostic AI clusters, Juniper has delivered high-performance Ethernet switch platforms. We are the first OEM to announce a platform based on the latest Broadcom Tomahawk 5 ASIC, the 800G QFX 5240 platform with its 51.2Tbs of bandwidth– we expect this switch to become a critical building block for most AI clusters designs.

Juniper’s PTX Series data center platform has also been extended with the new high-density 800G PTX 10002-36QDD fixed router (available in 1H24) and new 800G line cards for the PTX 10K chassis. Built on our own Express 5 custom silicon, our largest PTX 10K chassis now supports up to 576 x 800G ports for high radix spine and super spine architectures. Through silicon diversity, customers have leaf and spine switching design options to optimize for various factors such as power efficiency, buffer size, and scale.?

Building on our 400G market leadership, Juniper's new AI-optimized, high-radix 800G leaf and spine data center fabrics provide high-capacity, lossless connectivity for AI data centers. With a runway to 1.6 Terabits, Ethernet will continue to get faster, driving down costs while leveraging the vast Ethernet ecosystem to assure innovation at the speed of an industry versus a single vendor.

To get the most out of Ethernet, Juniper has also added advanced traffic management capabilities across our leaf and spine portfolio to assure high-bandwidth, lossless, low-latency, and scalable performance for the highest GPU efficiency and lowest job completion time (JCT). Support for RoCEv2 makes Juniper’s Ethernet fabrics the preferred solution for backend AI training and frontend inference models. Based on Juniper financial modeling, Ethernet has 50% lower TCO than InfiniBand - and given its ubiquity, we believe Ethernet will get better and cheaper faster than Ethernet.

3. Turn-key validated solutions ensure confidence and expedite deployment times

At Juniper, we don’t want our customers to have to choose between open solutions that save time and money and closed solutions that have assured interoperability. You can have it all.

To that end, we’ve invested in our own GPUs and built our own AI clusters, operated by Apstra. We’ve performed our own tests and tuned the clusters ourselves to achieve optimal performance. We’ve found that in most cases, we can achieve similar or better performance with Ethernet than with Infiniband.

To simplify AI deployments for our customers, Juniper also introduced several new Juniper Validated designs (JVDs). JVDs offer customers prescribed, rail-optimized or multi-layered clos fabrics for AI clusters to quickly plan and deploy stable production environments. They include Apstra, and cover the full range of AI use cases, from small radix AI training clusters to large radix ones, from centralized training clusters to distributed inference clusters. They encompass backend, frontend, storage and edge networks.

JVD “extensions” include integrations with Juniper’s broad DC security portfolio and third-party vendors, such as VMware. Tested by Juniper’s lab professionals, JVDs can be used as out-of-the-box designs or as a set of guidelines to reduce risk and properly size and budget for your AI cluster.

Juniper’s AI-Native Networking Platform was Designed to Address your Unique Challenges

?The announcement this week of Juniper’s AI-Native Networking Platform is an important step to embed AI into all aspects of networking and ultimately bring choice, flexibility, and lower costs to all of our customers. I believe that only Juniper has the products, the vision and the commitment to offer a comprehensive AI-Native Networking portfolio.

?With ‘AI for Networking’, Juniper is expanding its leading AIOps solution with new Marvis VNA for data center (with Apstra integration). With ‘Networking for AI’, Juniper has a new class of AI Data Center Networkingsolutions that are the easiest to manage, most flexible to design and fast to deploy. With both “AI for Networking” and “Networking for AI”, Juniper is helping our customers and partners take full advantage of the AI revolution.?

I am excited to be part of such a profound movement, which will have a substantial impact for many years to come.

Nikki Lee

Digital Marketing Manager at American Chemistry Council

9 个月

Someone, please help me how to start from the ground cance data center. I am a newbie. Any help is highly appreciated. Thank you

回复
Gary Perkins III

Innovative Network Engineer | IT Professional | Building Secure & Efficient IT Solutions

9 个月

Wow a snapshot of future data centers with AI networking capabilities. Thank you for sharing!

Arun Gandhi

AI/ML Product Marketing | Advisor | Lifelong Learner

10 个月

New platforms and #Marvis VNA for data centers with unique intent-based operations via Juniper Apstra bring new grit to the AI for Networking. Exciting times as AI moves from research to reality, with data centers being the engine behind the AI.

CK Lam

Data Center Network & AI Infrastructure | AI/ML Enthusiast | Aspiring Data Science Professional | Technologist

10 个月

I've seen increasing customers' anticipation and queries over what Juniper plans to do with AI in the Data Center since the acquisition of Mist with Marvis and finally, we have arrived at this significant milestone after many months of waiting. This will leap-frog what others are doing in this space by a huge margin with our AI-Native networking capabilities.

Waleed Heloo

MD at Midis Group

10 个月

Congrats Mansour Karam for wonderful platform

要查看或添加评论,请登录

社区洞察

其他会员也浏览了