登录查看更多内容

IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software

Michael Orr

Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader

发布日期: 2018年10月26日

+ 关注

Who : Microsoft, Purdue, Yale

Where: SIGCOMM 2014

Original article: ACM Library/Duet (slides : SIGCOMM 14/Duet Slides)

TL;DR for the genius-in-a-hurry:

The problem:

Data-center services at scale are implemented by having lots of service-instances running on multiple hosts for each service. Stateful Load-balancers are needed to connect clients with service-instances. Dedicated HW load-balancers are expensive and hard to modify/scale and move. SW load-balancers are cheap and flexible – but add significant latency and can only deal with a few Gbps of traffic, so to handle Tbps of traffic, we may need thousands of them.

DUET Solution

Duet takes an open SW load-balancer’s (Ananta) architecture, and implements it using free table entries of existing Data center switches. Since these switches don’t have enough table memory for all the services and service-instances needed, Duet installs on the switches the more BW-and-Latency sensitive services, and the rest are left on SW load-balancers – but this means you now need a lot less of them, and the Important services get HW handling.

Load-balancing in DUET scheme is done by assigning “virtual IP” addresses to represent the Service (“VIP’s”) and “Direct IP” addresses to each service instance (“DIPs”). Load-balancers receive traffic sent to the VIPs (i.e. to the service in general) and distribute this traffic to the Service-instances, by sending them to the DIPs

DUET does this as follows:

(1) The already existing commodity DC Switches are programmed to hold the VIP->DIP mappings. The switch is programmed to see VIP’s as directly attached Hosts (/32 in IPv4 terms).

(2) Switches will then announce themselves over BGP as being the best route to the VIP’s they hold. This causes all incoming traffic for a given VIP to be sent to the switch holding that VIP.

(3) The switch is programmed to see a VIP routing entry as pointing to a set of equal-cost paths to the destination DIP’s, so it will use ECMP to load-balance between these paths. Each ECMP Next-hop entry is then programmed to be an IP-in-IP tunnel start-point, so the packet will be encapsulated into an IP-In-IP packet aimed at the DIP –at wire speed and latency.

(4) Since ECMP uses a hash of the Header of the incoming packet to decide which Next-hop to use, this load-balancing will send the all traffic from a given client to the same DIP service-instance, thus satisfying the “Stateful” requirement

Duet handles the question of which VIP should go on which switch, handles faults and changes (e.g. Adding or moving service instances etc.)

Results are that high bandwidth services can be handled with far fewer SW load-balancers (so cheaper) with and with HW-scale latency.

Article Summary

Large-scale services in data-centers (e.g. a web-search engine that needs to handle millions of clients) are implemented by having service instances on thousands of servers. Clients, therefore, need to be directed to a specific server to handle their particular service-instance, and once a client has started a session with a given server, all further interaction from that client must be sent to the same service-instance, the same server.

This requires implementing Stateful Load-balancers that can handle the scale of traffic – the number and volume of incoming requests for the service.

Each [DC] service is a set of servers that work together as a single entity. Each server in the set has a unique direct IP ( DIP) address. Each service exposes one or more virtual IP ( VIP) outside the service boundary. The load balancer forwards the traf?c destined to a VIP to one of DIPs for that VIP. Even services within the same DC use VIPs to communicate with each other…

A typical DC supports thousands of services, each of which has at least one VIP and many DIPs associated with it. All incoming Internet traf?c to these services and most inter-service traf?c go through the load balancer. We observe that almost 70% of the total VIP traf?c is generated within DC, and the rest is from the Internet. The load balancer design must not only scale to handle this workload but also minimize the processing latency. This is because to ful?ll a single user request, multiple back-end services often need to communicate with each other — traversing the load balancer multiple times. Any extra delay imposed by the load balancer could have a negative impact on end-to-end user experience. Besides that, the load balancer design must also ensure high service availability in face of failures of VIPs, DIPs or network devices.

Dedicated Hardware based load-balancers exist, but they are expensive, hard to update with new functionality, and hard to scale or move on demand. Software Load-Balancers (SLB), on the other hand, are implemented on commodity servers, and are relatively cheap to scale, move and update – but suffer high-latency and low-bandwidth per instance, making it necessary to use a lot of them, and even then making it hard to satisfy service that need low-latency of single-flow high bandwidth.

As an example of the limitations of SW load-balancers the article cites the performance of Ananta, whose latency is a few hundred μ?Seconds, and whose maximum throughput is about 300K PPS (see drawing).

At this level handling 15Tbps of traffic would require 4000 SLB instances, making the solution expensive.

DUET proposes implementing Stateful load-balancing by

Configuring existing Network Switches to do load balancing at wire speed in addition to their normal operation and
Combining this with a small deployment of software Load-balancers to handle the cases where Switch table/memory capacity is exhausted and there are more flows to handle.

Note: Load balancers are referred to in the article as MUX’es (Multiplexers, tough really, should be De-Multiplexers), with Software load-balancers termed SMUXes and Hardware-based Load-balancers terms HMUXes

Stateful Load Balancing: From SW to Hardware

The authors model their Load-balancer on the architecture of Ananta :

(1) Each Load-balancer program has all of the VIP->DIP mappings

(2) Each Load-balancer uses BGP to announce itself as having the VIP’s as directly-attached hosts (in effect, telling the routers, “if you need to send traffic to this DIP, send to me”). Since ALL load balancers do this, they are seen by routers as equal-cost paths to the DIP’s, and the routers will use their built-in ECMP to load-balance traffic between all the SW load-balancers.

(3) Each load-balancer then encapsulates traffic arriving aimed at a given VIP into an IP-in-IP packet aimed at one of the DIPs where an instance of the service is hosted, thus load-balancing traffic for each VIP to the DIP’s that actually implement it. Return traffic does not pass through the Load-balancer, since a SW “agent” on the DIP host will send responses back directly.

The central element of a load-balancer is thus a VIP-to-DIP’s mapping table. When instances of the service are added/removed or relocated, this table must be updated in all Load-balancers.

DUET implements the “Ananta” ideas as follows:

(1) The already existing commodity DC Switches are programed to hold the VIP->DIP mappings. The switch is programmed to see VIP’s as directly attached Hosts (/32 in IPv4 terms).

(2) Switches will then naturally announce themselves over BGP as being the best route to the VIP’s they hold This causes all incoming traffic for a given VIP to be sent to the switch holding that VIP.

(3) The switch is programmed to see a VIP routing entry as pointing to a set of equal-cost paths to the destination DIP’s, so it will naturally use ECMP to load-balance between these paths. Each ECMP Next-hop entry is then programmed to be an IP-in-IP tunnel start-point, so the packet will be encapsulated into an IP-In-IP packet aimed at the DIP – exactly as Ananta did in SW, but at wire speed and latency. Since ECMP uses a hash of the Header of the incoming packet to decide which Next-hop to use, this load-balancing will send the all traffic from a given client to the same DIP service-instance, thus satisfying the “Stateful” requirement

Figure 2 illustrates this: incoming traffic is aimed at VIP=10.0.0.0. Switch sees this as a “/32” routing entry, which points at 2 equal-cost paths. ECMP selects one of these two possible “paths”, but the ECMP entries point to a “tunneling” table which tell the switch to encapsulate the packet to either 100.0.0.1 or 100.0.0.2

Note that these configurations are added to the switch in addition to its normal functioning, and it still continues to forward all other traffic (not aimed at a VIP) normally.

Thus, at the expense of some entries in the host forwarding, ECMP and tunneling tables, we can build a load balancer using commodity switches

Handling Switch Limits

Commodity switches have relatively small Host, ECMP and tunneling tables (The switches used by the Authors have 16K host-forwarding tables, 4K ECMP routes, and only 512 tunnel-start entries) . VIPs use Host-routing entries (next-hop entries in the switches) DIPs use the ECMP table and The Tunneling table, and thus the number of DIPs a switch can support is limited by the smaller of these tables’ capacities.

DUET solves this by

Each switch hosts only a small subset of the VIPs, and the DIPs assigned to these VIPs.
A few Software Load-balancers are used, who each holds ALL of the VIP-to-DIP mappings, as before
BGP is used to announce the switches as the best route to the VIPs they hold, and the SW load-balancers as the NEXT best route to the same VIP’s, and to any VIP not installed on any switch because we ran out of table entries on the switches.

This causes traffic for a VIP to be sent preferentially to a switch holding that VIP, if it exists and is alive, but if not – traffic will naturally be sent to the next-best route – which will get it to one of the SW load balancers.

Figure 3 illustrates this approach. VIP1 has two DIPs (D1 and D2), whereas VIP2 has one (D3). We assign VIP1 and VIP2 to switches C2 and A6 respectively, and ?ood the routing information in the network. Thus, when a source S1 sends a packet to VIP1, it is routed to switch C2, which then encapsulates the packet with either D1 or D2, and forwards the packet.

This makes it possible to do several things

We can handle an essentially unlimited number of flows, since when we run out space to install VIP-to-DIP mappings on switches, the rest are installed on the SW load-balancers
We can install on the switches first the VIPs for “elephant” services or services who are more latency sensitive, and leave the rest, who are less sensitive to latency, or deal with less bandwidth-hungry services to be handled by SW load-balancing

When a switch fails, traffic will fall back to being handled by the SW load-balancer smoothly and naturally, since it is the next-best route seen by all routers

We can scale high-demand services building a hierarchy of HW load-balancers with the “middle layer” of switches using a “TIP” to represent the service, and the top-layer switches having their routing table’s direct traffic for a given VIP to these 'TIPs" as equal-cost next-hop points, as illustrated.

DUET Host-agent

It is easy to see that Load-balancing in this manner cause hosts to receive packets that are IP-in-IP, with the outer IP aimed at the DIP, and the inner one at the Service VIP. A host must be able to deal with this traffic correctly, and so needs a Host-agent whose task is to Decapsulate incoming traffic and deliver the inner packet to the service-instance application, and on transmission, to direct traffic to the target setting IP address correctly.

Additional DUET elements

The above explains the principle of implementing a hybrid SW/HW Load-balancing system. In reality, DUET has to deal with some practical issues, and the authors present the solution to each

Which VIP should go on which Switch?

You can’t install a VIP on a switch if the switch does not have enough free memory (table space) to hold all the VIP-to-DIP mapping (Host route entries, ECMP entries and Tunneling entries).
You should not install a VIP on a switch if the traffic for that Service is likely to cause the switch links to Overload.
A VIP should be installed taking the Network topology and DIP locations as an important consideration (A VIP should be installed on a switch that is relatively close to the DIP’s in terms of the Data-center network topology, or traffic to the Load-balancing switch, and then from there to the DIP can “trombone” and add latency and congestion). If we can’t install all VIPs on switches, and have to leave some VIPs to be handled by SW load-balancers – which VIPs should go on Switches, and which can stay for SW handling? Etc.

The authors present an algorithm to make this decision of a suitable VIP-to-Switch allocation. They note the problem is NP-compete, and so this algorithm is not “optimal” – but works in practice. It uses knowledge of the Topology, the expected (based on historical monitoring) traffic volumes for each VIP etc.

Handling Fault and other changes

Switches and links in a DC fail. Machines hosting DIPs fail, or a DIP can be moved to a new location. Services and service instances can be added or removed (so DC will have additional/fewer VIPs or additional/fewer DIPs for a VIP). All of these mean that we have to update which switch holds which VIP, and the VIP-to-DIP mappings in both switches and SW load-balancer instances. These changes have to be done without breaking in-progress sessions.

This is done by using the SW load-balancers as an intermediate step.

As seen above If a Switch dies, traffic naturally goes to a SW Load-balancer, since it is seen as the next-best route to the same VIP the switch used to hold. This is used to move VIP’s to a new location – the switch stops declaring itself as the best route to the relevant VIP, traffic shifts to using the SW load-balancer, and the new switch, to which the VIP was moved starts declaring the new location as the best route – causing traffic for that VIP to go to the new location.

Similarly, to add DIPs, we add them to a SW load-balancer, then “take down” the switch”= for however long it takes to update it with the new set of DIPs, then it once again starts declaring itself best-route-to-this-VIP, and traffic goes back to being handled by HW

Handling Virtualization

The discussion above assumes a service instance is hosted on a server that is assigned a DIP. If the Datacenter uses VM’s as the service instance, and especially if we have service-instance that belong to a Tenant, then the service Instance VM may have its own DIP, which is different than the IP address of the host on which the VM runs. In such a case, the load-balancers (both SW and HW) must do Dual-encapsulation: IP-in-IP-in-IP, with the outer layer aimed at the HOST, the middle layer aimed at the VM’s IP address, and the inner one aimed at the VIP of the service

Results

DUET handles more traffic using an order of magnitude fewer SW load-balancers (and hence at a cheaper cost)
DUET creates a much lower latency for all services handled in HW
DUET handles failures and changes correctly and efficiently

Notes for Figure 16

for 10Tbs of traffic Duet uses about 100 SW load-balancers (as backstops), as opposed to needing several thousands. Even for a modest 1.25Tbps, Duet needs only tens of SW Load-balancers, where without it hundreds are needed
The results labeled “Ananta” and “Duet” assume each SW Load-balancer can handle about 300K PPS, as shown in figure 1 above. The “10G” bars assume this performance can be improved to supporting about 10G of traffic (about 1M PPS)

Final Thoughts & comments

I became interested in DUET, because the ideas from here, on how to turn a Switch into a Load-balancers are used in "EverFlow", an article about using Switches to create a good Network telemetry and debugging tool
The authors explain that the total number of DIP’s a switch can support is the lower of the capacity of ECMP and Tunneling tables of a switch. This also means that this is the maximum number of DIPs for a single VIP DUET supports
The authors note, but do not elaborate, the possibility of installing the same VIP on several switches. I see no reason this would not work. Since they will all declare themselves to be the best route to these VIP’s, upstream routers will send traffic TO these switches using ECMP themselves – thus load-balancing between the load-balancers. Similarly, we can add additional VIPs or additional DIPs to scale service “point of presence” in the data center to whatever size needed
DUET assumes separate tables for Host-routes and Longest-Prefix-Match routes, and separate ECMP and Tunneling tables. It seems that many last-generation switches tend to have unified table memory, with switch configuration free to change the allocation of the shared memory between different tables. This can impact Duet implementations, but intuitively, it seems that in many cases DUET will still work and be beneficial, and indeed, in can allow some switches to support larger DUET-relevant tables, depending on their Network role, as opposed to being fixed by the switch model.
DUET takes care to use free entries of the tables of the switch so the load-balancing is done in addition to normal operations by the switch, allowing usage of switches already installed, and in their current position. This is a good thing, since switches are naturally in line with the traffic that needs to be load-balanced, and using a SW load-balancer program running on a server necessarily means traffic going from a switch to that Server, and then from the server to the next-hop switch (usually the same ToR switch). As one implication of this approach, DUET avoids using the Longest-Prefix-Match table of the switch. It occurs to me it might be worth it to build a sort of DIY load-balancer by using a switch hardware, and NOT expecting it to do normal duty, JUST using it as a load-balancer, including using its usually-much-larger LPM table. While we would be limited in having this “appliance” at a fixed location in the Network topology, it would be a high-capacity/low-latency Load-balancer, and in some use-cases, probably worth exploring. If we use an OCP-complaint switch, and use the ECOMP socket to install a high-power CPU, we might even be able to install a switch-resident “server module” large enough to serve as an instance of the “SW load-balancer” Backstop, for an all-in-one fast AND flexible solution.

Yatish Kumar

6 年

Your summaries are much easier to read than the original papers. I am averaging 5 mins per post to learn the key concept. Please do more. You have at least one fan :)

2 次回应

查看更多评论

要查看或添加评论，请登录

Michael Orr的更多文章

Can you Apply Kelly's Formula to Investing ?

2023年1月24日

Can you Apply Kelly's Formula to Investing ?

Kelly's formula (sometimes called "fortune's formula") is widely touted as applicable to Investing, and in particular…

2 条评论
IIGR : HPCC: High Precision Congestion Control

2020年8月11日

IIGR : HPCC: High Precision Congestion Control

IIGR[1]: HPCC: High Precision Congestion Control Who : Alibaba, Harvard University, U. of Cambridge, MIT Where: ACM…

1 条评论
IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

2018年12月27日

IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

Who : Stanford: (Yilong Geng, Shiyu Liu, Zi Yin, Balaji Prabhakar, Mendel Rosenblum), Google: (Ashish Naik, Amin…

3 条评论
IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

2018年11月25日

IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

IIGR: Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network Who : Google ,…

1 条评论
KLAT2’s Flat Neighborhood Network

2018年11月12日

KLAT2’s Flat Neighborhood Network

Original Article here: Citeseerx or https://aggregate.org/FNN/ Who: U.
IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

2018年11月12日

IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

Who & Where: MIT, Cisco (Note overlap with authors of CONGA) , NSDI 2017 Original article here : https://people.csail.
Sonata: Query-Driven Streaming Network Telemetry

2018年11月12日

Sonata: Query-Driven Streaming Network Telemetry

Who: Princeton University, KAUST, NIKSUN Where: Apparently prepared for HOTNET16 (https://www.cs.
IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

2018年10月26日

IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

Who: Microsoft, UCSB, Princeton Where: SIGCOMM 2015 Original article here: MS Research/Everflow TL;DR for the…

1 条评论
IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

2018年9月23日

IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

Who & Where : Microsoft & Cisco, SIGCOMM 2014 Original articles here : Semantic Scholar/Conga TL;DR for the…
IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

2018年9月23日

IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

Who: Saleforce, Princeton, VMware & Barefoot. Original Article here: ACM Library/Clove TL;DR For the Busy Genius 1.

See all articles

IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software

Michael Orr

Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader

TL;DR for the genius-in-a-hurry:

DUET Solution

Article Summary

Stateful Load Balancing: From SW to Hardware

Handling Switch Limits

DUET Host-agent

Additional DUET elements

Which VIP should go on which Switch?

Handling Fault and other changes

Handling Virtualization

Results

Final Thoughts & comments

Michael Orr的更多文章

社区洞察

其他会员也浏览了

Azure Architecture and Services

NirvaShare Insights: Cloud Storage

Storj Architects and Delivers the Edge

3. Azure-Answers to 50 Interview Questions with Answers

Azure Application Gateway: Article - 1 of 24

Understanding and reducing AWS data transfer costs

?? Azure Service Bus Queues: Enhance Your Distributed Systems ??

February 18, 2023

The Modern Data Ecosystem: Choose The Right Instance

August 2022

TL;DR for the genius-in-a-hurry:

DUET Solution

Article Summary

Stateful Load Balancing: From SW to Hardware

Handling Switch Limits

DUET Host-agent

Additional DUET elements

Which VIP should go on which Switch?

Handling Fault and other changes

Handling Virtualization

Results

Final Thoughts & comments

Michael Orr的更多文章

Can you Apply Kelly's Formula to Investing ?

IIGR : HPCC: High Precision Congestion Control

IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

KLAT2’s Flat Neighborhood Network

IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

Sonata: Query-Driven Streaming Network Telemetry

IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

社区洞察

其他会员也浏览了

Azure Architecture and Services

NirvaShare Insights: Cloud Storage

Storj Architects and Delivers the Edge

3. Azure-Answers to 50 Interview Questions with Answers

Azure Application Gateway: Article - 1 of 24

Understanding and reducing AWS data transfer costs

?? Azure Service Bus Queues: Enhance Your Distributed Systems ??

February 18, 2023

The Modern Data Ecosystem: Choose The Right Instance

August 2022