Integrated SDN Automation or BGP EVPN for Data Center VXLAN Fabrics?
Customers that reach out to Pluribus are typically searching for a data center networking solution which can bring major simplification to their network operations, and greatly increase their ability to automate advanced services in a distributed cloud network environment. During initial customer meetings, one of the most common requests is to compare Pluribus?Unified Cloud Fabric??(UCF) architecture to alternative fabric solutions which are based on?BGP EVPN?paired with an external network management system, which is typically offered to automate some of the configuration complexity of BGP EVPN VXLAN fabrics.
Ultimately the answer you will see at the end of this blog is that there are four major areas where there are substantial differences between the Unified Cloud Fabric approach and a traditional BGP EVPN VXLAN fabric automated by an external management system:
In this two-part blog I am going to explain in detail why these four critical areas are significantly improved with an SDN control plane integrated into the network operating system itself. In part 1 I describe the integrated SDN architectural model of the Pluribus Networks‘ Unified Cloud Fabric architecture. In part 2, I compare the UCF architecture with a BGP EVPN VXLAN networking infrastructure supplemented by an external orchestration tool.
Part 1: Unified Cloud Fabric – Understanding the Architecture
To understand the value of integrated SDN automation, we need to establish a baseline understanding of the Unified Cloud Fabric architecture and where it fits into the software stack of the Netvisor??ONE network operating system (NOS).
The?Netvisor ONE NOS?is a Linux-based, containerized NOS that leverages open source software and is designed to run on?open networking switches and SmartNIC/DPUs. As shown in the diagram below, the UCF is a?distributed application?which is an “SDN fabric control plane” integrated directly into the Netvisor ONE and built on top of the TCP/IP stack, all of which tuns on each switch or DPU (node) participating in the fabric. The layer of?standard?IP routing and bridging protocols integrated in the Netvisor ONE OS provides the reachability across the network. So as long as the nodes can establish IP reachability across a distributed network, the UCF control plane can federate the nodes into a unified administration domain where all of the nodes in the fabric are now seen as one logical and programmable entity.
In summary we can view the UCF as a distributed application with its own fabric state database in each node, akin to a distributed databased cluster, built on top of standard TLS/TCP/IP network layer.
By relying on standard TCP/IP transport, the UCF distributed architecture has four important properties which makes it the most flexible solution in the market for deploying distributed Datacenter and Metro Networking applications:
Unified Cloud Fabric: Integrated Automation and Orchestration
The cloud networking model of the UCF software has three primary built-in automation and orchestration functionalities:
The Cloud Services Object Abstraction Model
The UCF networking model is predicated on a clear separation between underlay and overlay, where the fabric underlay operates as a transport layer for the cloud services which are built on top of the fabric overlay. We use the term “cloud services object abstraction model” to describe the networking model where the underlying physical connections and devices are abstracted and paired with a protocol free control plane to simplify network operations for the cloud user, very similar to the experience of programming network services in public cloud environments.
The fabric underlay includes the physical connections between switches and DPUs and is tasked with routing traffic between the VXLAN Virtual Tunnel Endpoints (VTEPs) (on each switch and DPU in the fabric), and as such it based on standard L2 and L3 protocols to interoperate with any network in any topology. The Pluribus UCF also automates the deployment of the underlay (e.g.?BGP unnumbered) and unifies the underlay and overlay, but this will not be the focus of this blog since BGP EVPN VXLAN only provides an overlay function.
The fabric overlay is a virtual network, leveraging VXLAN, abstracted from the underlay, defined in software but accelerated with switch and DPU hardware. The overlay is where the action is: a set of advanced cloud services (including L2 and L3 segmentation, distributed logical routing, distributed security, and service chaining policies) is implemented with an abstracted object model, which enables the end user to deploy these services with a?single command end-to-end in seconds,?irrespective of the number of nodes in the fabric.
The Fabric overlay presents an abstraction layer both for the physical topology and the individual devices, and it allows the programming of east-west services within the cloud with a protocol-free, logical object model similar to a public cloud environment.
Under the hood the UCF SDN control plane orchestrates the programming of each device of the fabric, each with a transaction ID to ensure 100% consistent configuration across all nodes. Moreover, the UCF control plane is also responsible for automating the translation of the fabric overlay services into BGP EVPN constructs, therefore enabling interoperability with third party BGP EVPN VXLAN fabrics.
To demonstrate the simplicity of the UCF overlay object model, let us create a L3VPN service, consisting of a tenant VRF with a single subnet, stretched across the fabric. The fabric below has 5 leaf switches, but the same configuration would apply to a fabric with 10, 50 or 100+ nodes.
Step 1.
Connect to any node of the fabric (via SSH or HTTPS) and, with a single command, create a distributed VRF (dVRF) create an overlay object. Note in the command below I specify the?scope?of the object as “fabric”, which signals the UCF control plane that it needs to automate the creation of the dVRF object across all the nodes of the fabric.
Step 2.
With a single command, create a distributed VLAN overlay object, which is required before one creates the subnet. Note in the command below I specify the scope of the object as “fabric”, which signals the UCF control plane that it needs to create the VLAN object across all the nodes of the fabric. Moreover, note that “auto-vxlan” qualifier which instructs the UCF control plane to automatically program VNI 10009 across all the tunnels and all the VTEPs in the fabric.
Step 3.
With a single command, create a distributed subnet overlay object (with an anycast gateway for distributed routing). Note how I attach the subnet object to the overlay VLAN (VNI 10009) created in step 2 and associate it to the distributed VRF (DEMO-dVRF) created in step 1.
As mentioned above, these 3 commands issued over a single SSH (or HTTPS) session to any node in the fabric are all it takes to create this L3VPN service, even across 100 or more nodes. By contrast if I were to create a similar basic L3VPN service with a BGP EVPN protocol approach across 100 nodes it would take 100 SSH sessions and about 2000 lines of configuration box-by-box, whether implemented manually, via scripting or using an external automation solution.
Scale is also improved by this overlay object approach where it is possible to build highly scalable multi-tenancy environments with 1000s of dVRFs using datacenter class ToR switches at their maximum hardware capacity. Contrast this with the BGP EVPN VXLAN approach where, on the same class of switches, the maximum VRF/tenant scalability is typically on the order of ~250 due to the load the BGP EVPN control plane puts on the switch.
Furthermore, in cloud networks it is necessary to share services (firewalls, gateways, NAT appliances, DNS etc.) among multiple tenants. With traditional protocols the complexity of coordinating the routing and security policies at scale across a fabric grows with the number of tenants and number of nodes in the fabric. Conversely with the UCF cloud services object abstraction model, designing a shared service in a multi-tenant environment only takes a few commands, independent of the number of nodes in the fabric. By combining logical objects like the distributed VRF object (shown in the example above) with abstracted policy objects, it is possible to create granular routing and security policies across the fabric with just a handful of global commands.
End-to-end Automated Provisioning with the Distributed Transaction Database
A major difference between the distributed architecture of the UCF and a traditional box-by-box based NOS is represented by the transaction distributed database which enables the consistent and automated provisioning of global objects across the fabric with full confidence. Each command to create, modify or delete distributed objects is represented as a transaction and stored in a global distributed database across all the nodes. Each transaction is associated with a transaction identifier (tid) which is a number sequentially incremented with every transaction.?The UCF control plane ensures this distributed object database is programmed consistently across the nodes of the fabric (i.e. each node is configured with the same global objects).
To commit a transaction into the global database, the UCF control plane utilizes a distributed three phase commit algorithm popular in distributed database systems. In the picture below note the fab-tid is the same for all the nodes, indicating that all of the nodes have the same exact copy of all the commands associated with global objects.
The transaction database allows the fabric operator to rollback with a single command the entire fabric by specifying the transaction ID number (tid) to which to rollback.
Similarly, if a new node joins the fabric, it automatically rolls forward to the latest fabric transaction (tid 1302 in the example above), therefore guaranteeing that any new node is automatically configured with the same fabric-wide objects as the rest of the fabric.
To achieve a similar functionality in a traditional NOS, it is possible only with the aid of a sophisticated external orchestrator which comes with flexibility and cost tradeoffs I will discuss in part 2 of this blog.
领英推荐
Centralized Management without External Controllers or Management Systems
The last of the UCF functionalities I discuss in this blog is the ability to provision and monitor an entire fabric from any node. This applies to the all main methods to orchestrate the fabric such as CLI and RESTful API.
To demonstrate the flexibility of the “multi-node” CLI and REST API, I will use a few examples with routing commands over a single SSH session from a single node of the fabric. In the fabric commands can be executed with a “global”, “local” or “group” context. By default, commands are executed with a fabric context, however it is possible to switch the context to either a single switch or a group of switches.
First, I show a command with “local” context. In the example below I am connected to Leaf1 and execute a command to show the BGP neighbors on Leaf4:
Similarly, I can issue a command with a context of multiple switches:
It is also possible to create “group” aliases and switch the context using the group alias. In the example below Spine1 and Spine2 are part of the “SPINES” group:
By default, most commands default to a global context. In the example below I look up a specific route across the entire fabric:
What is unique about the centralized management model of the fabric is that it is integrated in the distributed control plane of the fabric. There are no external controllers or management systems required therefore reducing the cost and complexity of managing the fabric. In contrast, all other automations solutions require multiple servers incurring capital costs for the hardware and software licenses along with operational costs of space, power, installation and management. Furthermore, these server-based solutions introduce complexity, especially when trying to span multiple locations. With UCF any switch or DPU has the same ability to control and monitor the entire fabric.
Part 2: UCF Integrated SDN Automation vs. BGP EVPN + External Automation for VXLAN Fabrics
The complexity of operating a BGP EVPN VXLAN infrastructure is widely acknowledged. As a result, many vendors offer automation solutions for BGP EVPN based on external management solutions. These management systems aim to reduce the complexity of provisioning and operating a BGP EVPN VXLAN fabric.?At first glance, most of these solutions appear to offer orchestration and automation properties similar to Pluribus UCF fabric automation. Thus, a typical question I receive from customers is “what are the advantages of your UCF compared to a BGP EVPN VXLAN solution automated with an external management solution?”
We can identify four major areas where there are substantial differences between the UCF Cloud networking model and a BGP EVPN VXLAN fabric automated by an external management system:
Let’s examine each of these points one by one.
Stronger Security and Better Availability
Network management or automation solutions are required to communicate with every single switch or router they manage via an out-of-band management network. As a result, each network device is now required to expose a set of TCP/UDP services to enable this communication with the management stations. So first, this approach expands the overall attack surface of the infrastructure to every single network device that needs to communicate with the management system.
Second, the communication between management stations and network devices occurs on a physically non redundant out-of-band management network. As a result, the management network represents a single point of failure whether by equipment failure or cyberattack, which if compromised reduces overall availability of the network operation.
On the other hand, the distributed UCF architecture provides single point of management capabilities for the entire fabric, but only one node (or two for redundancy) is required to expose services (e.g. SSH or HTTPS) in the network. In other words, a single node can act as a “gateway” to manage an entire fabric, and only that node is required to communicate with the rest of the network resources. This approach reduces the overall attack surface making the network architecture inherently more secure.
Furthermore, with UCF the use of out-of-band management is optional. For fabric control/management plane communication, most customers opt to use the in-band network which leverages the multiple connection of the fabric between nodes for physical redundancy and high-availability. This approach removes the out-of-band network as a single point of failure making the network architecture inherently more available.
Finally, Pluribus UCF is inherently able to offer per-flow telemetry for every flow traversing the fabric and this capability is included with the fabric license with no extra cost. Not only does this eliminate the need for a separate overlay visibility fabric (licenses, TAPs, TAP Aggregation switches, etc.) but it speeds troubleshooting to help the IT team with service assurance, rapid fault resolution and provides one more tool to detect and mitigate attacks such as DDoS attacks.
Greater Operational Simplification
A typical network management system provides the user with an input schema that is translated into a series of protocol configuration commands for each device and pushes these configurations to multiple boxes in the network one-by-one. The overall goal of most network management systems to eliminate the complexity of manual provisioning. However, the focus is still very much on provisioning protocols on each individual device box-by-box and neither the protocols nor the physical devices are abstracted out to the user. Thus, any IT admin who needs to manage a BGP EVPN VXLAN overlay even if automated by an external solution is required to deeply understand the internals of the deployed protocol and the overall topology of the network.
A good example is this is shown in this Aruba Composer?video?which demonstrates how to provision BGP EVPN. Fast forward to minute 2:30 and you will see that this tool helps to provision Import and Export Route Targets, Route Distinguishers and L2VNIs. These are all protocol constructs and parameters required on each box. The user is still required to understand in depth the constructs of the BGP EVPN protocol and be able to use all its knobs to provision and monitor complex network services in a multi-tenant infrastructure.?Management tools certainly simplify the provisioning process of BGP EVPN, but knowing how to operate to troubleshoot the protocol is still up to the user.
In contrast, as explained in part 1, the UCF operates the overlay with a set of?logical services objects in a protocol-free, truly abstracted and simplified overlay. Remember the example of the 3 commands to provision a L3VPN across the entire fabric vs. the 2000+ commands required by BGP EVPN? This is achieved through the power of the cloud services object abstraction model.
The user experience of the UCF resembles one of a Public Cloud – the user focuses on the construction of virtual networks based on abstracted objects representing logical services (e.g. VRFs, subnets, gateways, etc.). This level of abstraction provides greater operational simplification versus understanding how to deploy, operate and troubleshoot services with a routing protocol managed via an external management solution. Ultimately this means the IT team can focus their time and energy on strategy, outcomes and ultimately delighting their customers by moving at the speed of cloud.
I would be remiss If I did not mention here that UCF also integrates a BGP EVPN control plane. The primary use case for BGP EVPN within UCF is to provide overlay services interoperability with other third-party fabrics. UCF translates the fabric overlay objects into EVPN routes that can be exchanged with third party BGP EVPN clouds – which means that even BGP EVPN itself is highly abstracted and simplified in the Pluribus implementation. That said, while the UCF control plane completely abstracts the BGP EVPN configuration away from the user, the user is still required to understand how to monitor and troubleshoot BGP EVPN…just like in any BGP EVPN VXLAN environment.
Multi-site Deployment Flexibility
One of the Achille’s heels of external centralized management solutions is their dependency on an out-of-band network management system. Most out-of-band management networks are either impossible or very difficult to extend across multiple sites, which makes it difficult for an external management platform physically deployed in one location to control devices scattered across remote locations.
For example, Cisco ACI, in a multi-site environment, requires dedicating a group of three APIC controllers in each site and then a second tier of Multi-site Orchestrator controllers to coordinate the communication among the controllers in each site. Thus this solution comes with significant capex and opex cost, complexity and scalability limitations (e.g. no more than 12 sites maximum).
Other solutions rely on deploying the management software in the public cloud in order to reach multiple sites. This is another approach that is marketed as an advantage but actually also introduces cost and complexity – license cost for the instances of the NMS solution running in the cloud and then the cloud costs itself. Most solutions simply give up and don’t allow the management of a geo-distributed topology from a single location.
The UCF architecture is distributed by design and built for multi-site distributed deployments. As explained in part 1, as long as the underlay (IP core or WAN) can provide reachability among the management IP address of the nodes of the fabric, the fabric can be formed across any number of sites interconnected by any number of third-party networks. This includes sites across continents; for example UCF was recently deployed with 2 sites in India and one in Mexico, all in a single fabric.
This capability is built into the Netvisor NOS and does not require any external servers nor any out-of-band management networks. Furthermore, because the UCF intelligence is typically deployed on the TORs the topology is completely flexible including the ability to deploy spineless topologies or leaf/spine with 3rd?party spines as well as rings and other topologies to support flexible data center interconnect across sites. The UCF is the most flexible, most cost-effective solution to manage multiple data centers as a single logical fabric.
Lower Operational and Capital Cost
I raise cost as the last issue here, because it is always important to cloud data center operators. First and foremost, the operational savings for the Pluribus UCF model are significant. Deploy a service by declaring intent such as “create VLAN ID 110 scope-fabric” and the SDN control plane takes care of deploying the VLAN service across all nodes in seconds with a 100% guarantee of consistency, including supporting automated roll back and automated configuration of new nodes inserted into the fabric. Teams don’t have to know every knob of BGP EVPN and can spend their time focused on outcomes.
Furthermore, with per-flow telemetry integrated it removes the huge cost of a parallel visibility fabric typically required in data center deployments. Finally, by eliminated numerous servers and other external devices power, space and cooling are reduced for additional opex savings.
In terms of capital expense, in a few situations I have seen customers evaluate competitive solutions where their network costs increased as much as 100%, when factoring in the expense of the hardware, software and support required to implement an external management system for a BGP EVPN VXLAN fabric. With UCF there is no extra tax to add centralized management and automation of the overlay services. The distributed automation and visibility functionalities come integrated into the fabric NOS at no extra cost.
Summary
Sometimes it is difficult to parse through the marketing buzz words when all vendor claims look and sound the same. This is why I thought it was worth the time to step through our architecture and the resulting benefits of an integrated SDN control plane that runs on each fabric node without the need for any external servers that are required to automate more traditional BGP EVPN VXLAN fabrics.
To recap, this approach provides the following benefits:
If you want to learn more take a spin around our website or click?here?to set up a demo and we’ll get one of our systems engineers to demonstrate the power of the Unified Cloud Fabric.