IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software
Michael Orr
Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader
Who : Microsoft, Purdue, Yale
Where: SIGCOMM 2014
Original article: ACM Library/Duet (slides : SIGCOMM 14/Duet Slides)
TL;DR for the genius-in-a-hurry:
The problem:
Data-center services at scale are implemented by having lots of service-instances running on multiple hosts for each service. Stateful Load-balancers are needed to connect clients with service-instances. Dedicated HW load-balancers are expensive and hard to modify/scale and move. SW load-balancers are cheap and flexible – but add significant latency and can only deal with a few Gbps of traffic, so to handle Tbps of traffic, we may need thousands of them.
DUET Solution
Duet takes an open SW load-balancer’s (Ananta) architecture, and implements it using free table entries of existing Data center switches. Since these switches don’t have enough table memory for all the services and service-instances needed, Duet installs on the switches the more BW-and-Latency sensitive services, and the rest are left on SW load-balancers – but this means you now need a lot less of them, and the Important services get HW handling.
Load-balancing in DUET scheme is done by assigning “virtual IP” addresses to represent the Service (“VIP’s”) and “Direct IP” addresses to each service instance (“DIPs”). Load-balancers receive traffic sent to the VIPs (i.e. to the service in general) and distribute this traffic to the Service-instances, by sending them to the DIPs
DUET does this as follows:
(1) The already existing commodity DC Switches are programmed to hold the VIP->DIP mappings. The switch is programmed to see VIP’s as directly attached Hosts (/32 in IPv4 terms).
(2) Switches will then announce themselves over BGP as being the best route to the VIP’s they hold. This causes all incoming traffic for a given VIP to be sent to the switch holding that VIP.
(3) The switch is programmed to see a VIP routing entry as pointing to a set of equal-cost paths to the destination DIP’s, so it will use ECMP to load-balance between these paths. Each ECMP Next-hop entry is then programmed to be an IP-in-IP tunnel start-point, so the packet will be encapsulated into an IP-In-IP packet aimed at the DIP –at wire speed and latency.
(4) Since ECMP uses a hash of the Header of the incoming packet to decide which Next-hop to use, this load-balancing will send the all traffic from a given client to the same DIP service-instance, thus satisfying the “Stateful” requirement
Duet handles the question of which VIP should go on which switch, handles faults and changes (e.g. Adding or moving service instances etc.)
Results are that high bandwidth services can be handled with far fewer SW load-balancers (so cheaper) with and with HW-scale latency.
Article Summary
Large-scale services in data-centers (e.g. a web-search engine that needs to handle millions of clients) are implemented by having service instances on thousands of servers. Clients, therefore, need to be directed to a specific server to handle their particular service-instance, and once a client has started a session with a given server, all further interaction from that client must be sent to the same service-instance, the same server.
This requires implementing Stateful Load-balancers that can handle the scale of traffic – the number and volume of incoming requests for the service.
Each [DC] service is a set of servers that work together as a single entity. Each server in the set has a unique direct IP ( DIP) address. Each service exposes one or more virtual IP ( VIP) outside the service boundary. The load balancer forwards the traf?c destined to a VIP to one of DIPs for that VIP. Even services within the same DC use VIPs to communicate with each other…
A typical DC supports thousands of services, each of which has at least one VIP and many DIPs associated with it. All incoming Internet traf?c to these services and most inter-service traf?c go through the load balancer. We observe that almost 70% of the total VIP traf?c is generated within DC, and the rest is from the Internet. The load balancer design must not only scale to handle this workload but also minimize the processing latency. This is because to ful?ll a single user request, multiple back-end services often need to communicate with each other — traversing the load balancer multiple times. Any extra delay imposed by the load balancer could have a negative impact on end-to-end user experience. Besides that, the load balancer design must also ensure high service availability in face of failures of VIPs, DIPs or network devices.
Dedicated Hardware based load-balancers exist, but they are expensive, hard to update with new functionality, and hard to scale or move on demand. Software Load-Balancers (SLB), on the other hand, are implemented on commodity servers, and are relatively cheap to scale, move and update – but suffer high-latency and low-bandwidth per instance, making it necessary to use a lot of them, and even then making it hard to satisfy service that need low-latency of single-flow high bandwidth.
As an example of the limitations of SW load-balancers the article cites the performance of Ananta, whose latency is a few hundred μ?Seconds, and whose maximum throughput is about 300K PPS (see drawing).
At this level handling 15Tbps of traffic would require 4000 SLB instances, making the solution expensive.
DUET proposes implementing Stateful load-balancing by
- Configuring existing Network Switches to do load balancing at wire speed in addition to their normal operation and
- Combining this with a small deployment of software Load-balancers to handle the cases where Switch table/memory capacity is exhausted and there are more flows to handle.
Note: Load balancers are referred to in the article as MUX’es (Multiplexers, tough really, should be De-Multiplexers), with Software load-balancers termed SMUXes and Hardware-based Load-balancers terms HMUXes
Stateful Load Balancing: From SW to Hardware
The authors model their Load-balancer on the architecture of Ananta :
(1) Each Load-balancer program has all of the VIP->DIP mappings
(2) Each Load-balancer uses BGP to announce itself as having the VIP’s as directly-attached hosts (in effect, telling the routers, “if you need to send traffic to this DIP, send to me”). Since ALL load balancers do this, they are seen by routers as equal-cost paths to the DIP’s, and the routers will use their built-in ECMP to load-balance traffic between all the SW load-balancers.
(3) Each load-balancer then encapsulates traffic arriving aimed at a given VIP into an IP-in-IP packet aimed at one of the DIPs where an instance of the service is hosted, thus load-balancing traffic for each VIP to the DIP’s that actually implement it. Return traffic does not pass through the Load-balancer, since a SW “agent” on the DIP host will send responses back directly.
The central element of a load-balancer is thus a VIP-to-DIP’s mapping table. When instances of the service are added/removed or relocated, this table must be updated in all Load-balancers.
DUET implements the “Ananta” ideas as follows:
(1) The already existing commodity DC Switches are programed to hold the VIP->DIP mappings. The switch is programmed to see VIP’s as directly attached Hosts (/32 in IPv4 terms).
(2) Switches will then naturally announce themselves over BGP as being the best route to the VIP’s they hold This causes all incoming traffic for a given VIP to be sent to the switch holding that VIP.
(3) The switch is programmed to see a VIP routing entry as pointing to a set of equal-cost paths to the destination DIP’s, so it will naturally use ECMP to load-balance between these paths. Each ECMP Next-hop entry is then programmed to be an IP-in-IP tunnel start-point, so the packet will be encapsulated into an IP-In-IP packet aimed at the DIP – exactly as Ananta did in SW, but at wire speed and latency. Since ECMP uses a hash of the Header of the incoming packet to decide which Next-hop to use, this load-balancing will send the all traffic from a given client to the same DIP service-instance, thus satisfying the “Stateful” requirement
Figure 2 illustrates this: incoming traffic is aimed at VIP=10.0.0.0. Switch sees this as a “/32” routing entry, which points at 2 equal-cost paths. ECMP selects one of these two possible “paths”, but the ECMP entries point to a “tunneling” table which tell the switch to encapsulate the packet to either 100.0.0.1 or 100.0.0.2
Note that these configurations are added to the switch in addition to its normal functioning, and it still continues to forward all other traffic (not aimed at a VIP) normally.
Thus, at the expense of some entries in the host forwarding, ECMP and tunneling tables, we can build a load balancer using commodity switches
Handling Switch Limits
Commodity switches have relatively small Host, ECMP and tunneling tables (The switches used by the Authors have 16K host-forwarding tables, 4K ECMP routes, and only 512 tunnel-start entries) . VIPs use Host-routing entries (next-hop entries in the switches) DIPs use the ECMP table and The Tunneling table, and thus the number of DIPs a switch can support is limited by the smaller of these tables’ capacities.
DUET solves this by
- Each switch hosts only a small subset of the VIPs, and the DIPs assigned to these VIPs.
- A few Software Load-balancers are used, who each holds ALL of the VIP-to-DIP mappings, as before
- BGP is used to announce the switches as the best route to the VIPs they hold, and the SW load-balancers as the NEXT best route to the same VIP’s, and to any VIP not installed on any switch because we ran out of table entries on the switches.
This causes traffic for a VIP to be sent preferentially to a switch holding that VIP, if it exists and is alive, but if not – traffic will naturally be sent to the next-best route – which will get it to one of the SW load balancers.
Figure 3 illustrates this approach. VIP1 has two DIPs (D1 and D2), whereas VIP2 has one (D3). We assign VIP1 and VIP2 to switches C2 and A6 respectively, and ?ood the routing information in the network. Thus, when a source S1 sends a packet to VIP1, it is routed to switch C2, which then encapsulates the packet with either D1 or D2, and forwards the packet.
This makes it possible to do several things
- We can handle an essentially unlimited number of flows, since when we run out space to install VIP-to-DIP mappings on switches, the rest are installed on the SW load-balancers
- We can install on the switches first the VIPs for “elephant” services or services who are more latency sensitive, and leave the rest, who are less sensitive to latency, or deal with less bandwidth-hungry services to be handled by SW load-balancing
When a switch fails, traffic will fall back to being handled by the SW load-balancer smoothly and naturally, since it is the next-best route seen by all routers
- We can scale high-demand services building a hierarchy of HW load-balancers with the “middle layer” of switches using a “TIP” to represent the service, and the top-layer switches having their routing table’s direct traffic for a given VIP to these 'TIPs" as equal-cost next-hop points, as illustrated.
DUET Host-agent
It is easy to see that Load-balancing in this manner cause hosts to receive packets that are IP-in-IP, with the outer IP aimed at the DIP, and the inner one at the Service VIP. A host must be able to deal with this traffic correctly, and so needs a Host-agent whose task is to Decapsulate incoming traffic and deliver the inner packet to the service-instance application, and on transmission, to direct traffic to the target setting IP address correctly.
Additional DUET elements
The above explains the principle of implementing a hybrid SW/HW Load-balancing system. In reality, DUET has to deal with some practical issues, and the authors present the solution to each
Which VIP should go on which Switch?
- You can’t install a VIP on a switch if the switch does not have enough free memory (table space) to hold all the VIP-to-DIP mapping (Host route entries, ECMP entries and Tunneling entries).
- You should not install a VIP on a switch if the traffic for that Service is likely to cause the switch links to Overload.
- A VIP should be installed taking the Network topology and DIP locations as an important consideration (A VIP should be installed on a switch that is relatively close to the DIP’s in terms of the Data-center network topology, or traffic to the Load-balancing switch, and then from there to the DIP can “trombone” and add latency and congestion). If we can’t install all VIPs on switches, and have to leave some VIPs to be handled by SW load-balancers – which VIPs should go on Switches, and which can stay for SW handling? Etc.
The authors present an algorithm to make this decision of a suitable VIP-to-Switch allocation. They note the problem is NP-compete, and so this algorithm is not “optimal” – but works in practice. It uses knowledge of the Topology, the expected (based on historical monitoring) traffic volumes for each VIP etc.
Handling Fault and other changes
Switches and links in a DC fail. Machines hosting DIPs fail, or a DIP can be moved to a new location. Services and service instances can be added or removed (so DC will have additional/fewer VIPs or additional/fewer DIPs for a VIP). All of these mean that we have to update which switch holds which VIP, and the VIP-to-DIP mappings in both switches and SW load-balancer instances. These changes have to be done without breaking in-progress sessions.
This is done by using the SW load-balancers as an intermediate step.
As seen above If a Switch dies, traffic naturally goes to a SW Load-balancer, since it is seen as the next-best route to the same VIP the switch used to hold. This is used to move VIP’s to a new location – the switch stops declaring itself as the best route to the relevant VIP, traffic shifts to using the SW load-balancer, and the new switch, to which the VIP was moved starts declaring the new location as the best route – causing traffic for that VIP to go to the new location.
Similarly, to add DIPs, we add them to a SW load-balancer, then “take down” the switch”= for however long it takes to update it with the new set of DIPs, then it once again starts declaring itself best-route-to-this-VIP, and traffic goes back to being handled by HW
Handling Virtualization
The discussion above assumes a service instance is hosted on a server that is assigned a DIP. If the Datacenter uses VM’s as the service instance, and especially if we have service-instance that belong to a Tenant, then the service Instance VM may have its own DIP, which is different than the IP address of the host on which the VM runs. In such a case, the load-balancers (both SW and HW) must do Dual-encapsulation: IP-in-IP-in-IP, with the outer layer aimed at the HOST, the middle layer aimed at the VM’s IP address, and the inner one aimed at the VIP of the service
Results
- DUET handles more traffic using an order of magnitude fewer SW load-balancers (and hence at a cheaper cost)
- DUET creates a much lower latency for all services handled in HW
- DUET handles failures and changes correctly and efficiently
Notes for Figure 16
- for 10Tbs of traffic Duet uses about 100 SW load-balancers (as backstops), as opposed to needing several thousands. Even for a modest 1.25Tbps, Duet needs only tens of SW Load-balancers, where without it hundreds are needed
- The results labeled “Ananta” and “Duet” assume each SW Load-balancer can handle about 300K PPS, as shown in figure 1 above. The “10G” bars assume this performance can be improved to supporting about 10G of traffic (about 1M PPS)
Final Thoughts & comments
- I became interested in DUET, because the ideas from here, on how to turn a Switch into a Load-balancers are used in "EverFlow", an article about using Switches to create a good Network telemetry and debugging tool
- The authors explain that the total number of DIP’s a switch can support is the lower of the capacity of ECMP and Tunneling tables of a switch. This also means that this is the maximum number of DIPs for a single VIP DUET supports
- The authors note, but do not elaborate, the possibility of installing the same VIP on several switches. I see no reason this would not work. Since they will all declare themselves to be the best route to these VIP’s, upstream routers will send traffic TO these switches using ECMP themselves – thus load-balancing between the load-balancers. Similarly, we can add additional VIPs or additional DIPs to scale service “point of presence” in the data center to whatever size needed
- DUET assumes separate tables for Host-routes and Longest-Prefix-Match routes, and separate ECMP and Tunneling tables. It seems that many last-generation switches tend to have unified table memory, with switch configuration free to change the allocation of the shared memory between different tables. This can impact Duet implementations, but intuitively, it seems that in many cases DUET will still work and be beneficial, and indeed, in can allow some switches to support larger DUET-relevant tables, depending on their Network role, as opposed to being fixed by the switch model.
- DUET takes care to use free entries of the tables of the switch so the load-balancing is done in addition to normal operations by the switch, allowing usage of switches already installed, and in their current position. This is a good thing, since switches are naturally in line with the traffic that needs to be load-balanced, and using a SW load-balancer program running on a server necessarily means traffic going from a switch to that Server, and then from the server to the next-hop switch (usually the same ToR switch). As one implication of this approach, DUET avoids using the Longest-Prefix-Match table of the switch. It occurs to me it might be worth it to build a sort of DIY load-balancer by using a switch hardware, and NOT expecting it to do normal duty, JUST using it as a load-balancer, including using its usually-much-larger LPM table. While we would be limited in having this “appliance” at a fixed location in the Network topology, it would be a high-capacity/low-latency Load-balancer, and in some use-cases, probably worth exploring. If we use an OCP-complaint switch, and use the ECOMP socket to install a high-power CPU, we might even be able to install a switch-resident “server module” large enough to serve as an instance of the “SW load-balancer” Backstop, for an all-in-one fast AND flexible solution.
Your summaries are much easier to read than the original papers. I am averaging 5 mins per post to learn the key concept. Please do more. You have at least one fan :)