Why does EVPN play smaller role in Cisco ACI?
Vahid Nazari. Data Center Consulting Engineer

Why does EVPN play smaller role in Cisco ACI?

Ethernet VPN, or EVPN, is one of the most well-known protocols in both service providers and data center fabrics. It well extends the BGP and makes it possible to include endpoint reachability information such as MAC and IP addresses, by adding an L2 address family. This fortunate confluence of BGP+EVPN then results in a strong and excellent control plane for VXLAN. That's how we name the comprehensive solution as "VXLAN MP-BGP EVPN". BGP EVPN provides significant enhancements for VXLAN such as "ARP Suppression", "Distributed IP Anycast Gateway", "Endpoint Mobility", "Virtual Port-Channel (vPC)" and so on.

Although Cisco ACI leverages VXLAN to build the infrastructure, when you come to the technology, you realize that some roles have changed and this may seem strange at first. Pay attention : EVPN is still used as part of Overlay control plane protocols in some ACI solutions in order to exchange endpoint reachability information across Pods or Sites. (For instance : ACI Multi-Pod, ACI Multi-Site and ACI Remote leaf). However, it's no longer utilized within any of the pods anymore. At first, you might think that it has simply been replaced by another protocol. But the subject goes beyond the scope of these words! Endpoint learning mechanisms in ACI has basically gone a different way, comparing to VXLAN BGP EVPN. We've heard a lot about BGP EVPN's features. So why did these changes happen?

To make a long story short, Cisco ACI is not supposed to do the same thing that VXLAN does! Rather, this technology has given rise to larger ambitions and dreams.

Generally speaking, comparing Cisco ACI with VXLAN BGP EVPN is basically wrong! since they are not the same in what they are supposed to do. It is obvious that, Some enhancements may be required to accommodate new features, and greater goals. Of course they don't just include EVPN!

How is endpoint learning done in VXLAN BGP EVPN?

In VXLAN BGP EVPN, all leaf nodes within a single fabric will advertise, learn, and store all endpoint information, even if there is never an endpoint behind a switch that requires this data. It's one of the situations in which Cisco ACI is dissatisfied.

  • To begin with, Cisco ACI is more than simply a large network switch. ACI provides a zero-trust network, which actually takes the shape of a massive firewall. Just as there may be a vast number of security policies in firewalls, there are in Cisco ACI, called zoning rules that are programmed on leaf nodes. It means they need resources. At the same time, this technology would have to be capable of supporting very huge infrastructures. As a preliminary conclusion, the resources have to be used more efficiently.

The hardware resource savings are a huge advantage for scalable fabric.

  • According to the mentioned above, leaf nodes in ACI don’t have to consume their hardware resources to store information about all the endpoints. Rather, they store only the necessary information for remote endpoints with which the leaf is actively communicating.
  • Cisco ACI Endpoint learning provides scalable forwarding within the fabric. For instance, in BGP+EVPN for each movement of an endpoint, a new update message is sent to all leaf nodes. In ACI on other other hand, there are bounce entries through which only three components need to be updated during each movement, regardless of how many leaf switches the fabric contains.

Furthermore, VXLAN BGP EVPN has no appropriate option for Stretched Fabric! of course there is basically EVPN Multipod, along with Multi-Fabric and Multi-Site solutions. But duo to some critical shortcomings exist in VXLAN Multipod, It's essentially not recommended to use for Multi-Pod or Active-Active datacenters which are geographically dispersed (I'll go over its issues letter). This means that even for Multi-datacenter environments that are all part of a same Stretched-Fabric topology, we have to choose VXLAN Multi-site solution in which assumed there are several separated VXLAN fabrics interconnected together.

Of course VXLAN Multi-Site is recognized as a brilliant technology that provides both Layer 2 and Layer 3 interconnections for completely independent VXLAN fabrics, but the main use case of this solution is for DCI. there is no end-to-end VXLAN tunneling in this solution that is, for each inter-DC sending and receiving of just a single packet, VXLAN encapsulation is done 6 times! this would potentially problematic for mission critical Applications such as high-frequency trading, virtual reality over networks, peak conditions of banking transactions and so on.

  • Cisco ACI has also introduced Multi-Site solution in which both Layer 2 and Layer 3 communications across ACI Fabrics are possible. Nevertheless, ACI has intensely concentrated on stretched-fabric topology relying on Multi-Pod solution, where there are significant enhancements in terms of endpoint learning mechanisms and Failure domain isolation.

That's how Cisco has attempted to provide one well-fitting ACI solution for every scenario and requirement.

  • In general, 'Cisco ACI Multi-Pod', 'ACI Remote Leaf' and ' ACI vPod' are all part of ACI Stretched Fabric solutions, through which configure and maintain using one APIC Cluster. Each solution however is in compliance with a specific condition, in which even the simplest and smallest infrastructures have been paid attention to.
  • For instance, ACI Multi-Pod is introduced for either a very large infrastructure, in which it may not be possible to deploy a single Leaf-and-Spine architecture, or Active/Active sites which are geographically dispersed (Single AZ in AWS terminology). On the flip side, there may be a small remote site where it may not possible or desirable to deploy a full ACI Pod (with leaf and spine nodes); ACI Remote Leaf would be a solid option for that. In addition to above, there also may be a DC with very small footprint in which the backups are stored only and no service is provided on it. we can leverage on?a software-only extension for such these sites, where there are basically no physical Leaf and Spine nodes.

How is endpoint learning done in Cisco ACI?

In contrast, Cisco ACI learns endpoint information in the data plane during packet forwarding, so there is no MP-BGP+EVPN up and running inside each ACI Pod.

Keep that in mind; the MP-BGP with VPNv4 alone still exists in the Overlay-1 VRF inside the Infra Tenant. It's used to distribute external routes from border leafs to other leaf switches.

Cisco ACI relies on the resources of spine switches instead of leafs, to store and collect all endpoint information.

It sounds more efficient.

ACI actually uses the Council Of Oracle Protocol (COOP) database located on each spine switch, known as an "Oracle". Since the hosts are directly connected to leafs, each leaf, which is known as a "citizen," is responsible for reporting its local endpoints to the COOP database. As a result, all endpoint information in ACI Fabric is stored in the spine COOP database. Consequently, there is no need for a leaf switch to already have remote endpoint information, whereas it could easily forward packets to the spine in the event that it doesn't know about a particular remote endpoint. This forwarding behavior is called "hardware proxy" or "spine proxy".

A key fact: A leaf switch already has endpoint information thanks to BGP+EVPN, but is there an option other than transmitting traffic to a spine? Is there anything special? like the packet is flying? Hardware-proxy means : Hey, lovely Leaf. You have no idea about the destination? It's Ok. don't butter yourself. Just keep forwarding! I know what should I do. Some engineers I've spoken, believed that leaf switches query the spine, catch the information, and then forward the traffic. It's incorrect. Hardware proxy doesn't work like that!

Don't worry about silent hosts, endpoint mobility, or even the movement of an IP address to a new MAC. The solutions are already provided.

Now, let's drill down more on VXLAN Multipod solution and go over the issues it has.

VXLAN BGP EVPN Multi-Pod

The first solution for extending the VXLAN fabric to more than one infrastructure is illustrated below, known as the VXLAN Multipod. ((Approximately deprecated!)) Of course, it's the most practical way of implementing this solution in which control plane protocols are isolated across two pods.

VXLAN Multi-Pod

Even though the control plane protocols, including Underlay IGP and Overlay BGP, are separated from each other, the same VXLAN EVPN Fabric is extended across different locations, which leads the whole infrastructure to function like a single VXLAN fabric. In this situation, all endpoint information including MAC and IP are shared and advertised between two pods. With that said, for each movement of one endpoint across leaf nodes in Pod1, a new control plane update is sent towards Pod2 since it's the default behavior of BGP+EVPN in a single VXLAN Fabric (End-to-End EVPN Updates). The failure domain is kept extended across all the pods, and the scalability remains the same as it's again! a single VXLAN fabric. Eventually, as I mentioned before, all leaf switches have to learn all the endpoint information. Because of these shortcomings, this solution is not recommended for active/active geographically dispersed sites.

Cisco ACI Multi-Pod meanwhile, is completely a different story!

  • ACI Multi-Pod is also a single ACI Fabric however, in this solution, different instances of IS-IS, COOP, and MP-BGP protocols run locally inside each Pod. It makes end-to-end VXLAN tunneling possible while providing enhancements to isolate as much as possible the failure domains between Pods.
  • Local endpoint information belonging to each Pod are never advertised to remote leaf devices beforehand. Instead, ACI Leaf switches will learn remote MAC and IP information during the packet forwarding in the data plane (ideally, hardware proxy-based forwarding), just like what happens in a single pod.

Why not outsource the responsibility for keeping endpoint information to spines? Don't forget that the COOP database within each pod already contains all the local endpoint information, and they will be synchronized across pods through MP-BGP EVPN.

  • Each pod has a dedicated 'Anycast VTEP address' which is available on all the spine nodes deployed in it. The first time that endpoint EP1 in pod1 is added to the local COOP database, an MP-BGP update is sent to the remote spine nodes in Pod2. The remote spine node adds the information to the COOP database and also synchronizes it with all the other local spine nodes. The EP1 in Pod 2 is now associated with the Anycast VTEP address of Pod 1, which is considered the next-hop address for Hardware Proxy. In this way, EP1 can move either way on Pod 1, but no new updates will be sent to Pod 2.

The VXLAN Multipod has been replaced by more effective solutions, such as VXLAN Multi-Fabric and, especially, the VXLAN Multi-Site, which is one of the most appropriate solutions in traditional multi-datacenter environments.

VXLAN BGP EVPN Multi-Site

EVPN Multi-Site technology is based on IETF draft-sharma-multi-site-evpn. In this solution, there are two or more completely independent VXLAN fabrics that are interconnected together through a VXLAN BGP EVPN Layer 2 and Layer 3 overlay. This overlay network is also known as 'Site-External network'. Thus, unlike the previous Multipod architecture, there is neither a shared EVPN fabric nor an extended underlay across different sites.

VXLAN EVPN Multi-Site

VXLAN EVPN Multi-Site could be used for scaling-up a large intra-DC network, Datacenter Interconnect (DCI) and also integrate with legacy networks. As you can see in the picture above, the key functional components of this architecture are Border Gateways (BGW).

If you wish to learn more about VXLAN Multi-Site, click on the link below.

In a brief explanation, BGWs separate the VXLAN fabric-side from the Site-External network and mask the site-internal VTEPs. Now what does this mean, As it is shown in the following picture, the border gateway re-encapsulates traffic and changes the outer source and the outer destination. That is, for each sending and receiving of traffic, VXLAN encapsulation is done six times.

No alt text provided for this image

  1. Leaf10 to BGW11
  2. BGW11 to BGW of the next site
  3. BGW22 to leaf20
  4. Leaf20 to BGW21
  5. BGW21-to-BGW of the first site
  6. BGW11 to leaf10

Of course the forwarding enhancements are not limited to EVPN, But also no Multicast PIM is up and running for handling BUM traffics within ACI Fabric. ACI relies on FTAG mechanism in order to make a path just like a Multicast Tree among Leaf and Spine nodes.

Leon Lai

Technology Evangelist in Networking, CyberSecurity, Cloud. (CCIE x CCDE Expert | DevNET professional | CCSP | 13 x Azure Cert | CKA | Fortinet Architect | Splunk | RPA)

1 年

Good sharing. Thanks

回复
Benzo Bagheri

network consultant

2 年

interesting explanations thanks

Md.Masroor Ahmed

A meticulous professional with prominent experience as Network Engineer

2 年

Excellent

要查看或添加评论,请登录

Vahid Nazari的更多文章

社区洞察

其他会员也浏览了