IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters
Michael Orr
Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader
Who & Where : Microsoft & Cisco, SIGCOMM 2014
Original articles here : Semantic Scholar/Conga
TL;DR for the genius-in-a-hurry:
Load-balancing of traffic in 2-tier CLOS datacenter networks by splitting flows into “flowlets” whenever there is a long-enough gap in the sequence of packets in a given flow. “Long enough” means that there will be a low chance of Out-of-Order packet, due to later packets in the flow taking a different, faster path. The path selected for each flowlet is selected by real-time estimation of the congestion on each path, with this estimate based on congestion information exchange among Leaf switches (i.e. Leaf-to-Leaf feedback).
CONGA works on Overlay Networks utilizing some form of tunneling (e.g. VXLAN) to carry this congestion information. If VXLAN VTEP’s switches are considered the “leaves” for the purposes of CONGA. Each switch along the path adds in-transit congestion information to the VXLAN headers, target leaf sends back the congestion seen per path.
CONGA does better than ECMP, and MCTCP especially with “Asymmetric” paths – when the Network has link failures, or significant congestion differences among paths, ECMP will keep sending packets over the impacted paths. CONGA will shift traffic to better ones.
CONGA also deals well with incast situations.
CONGA results are impressive:
“CONGA has 5× better ?ow completion times than ECMP even with a single link failure and achieves 2–8× better throughput than MPTCP in Incast scenarios”
Note CONGA is widely cited in many later Load-balancing papers as a near-optimal Load-balancing scheme, with the main issues preventing its more widespread adoption the need for Flow-let support in switches and the relatively complex scheme to collect and use congestion feedback at scale.
Article Detailed Summary
Background - why ECMP and MPTCP are not enough
ECMP
The de-facto standard for Load-balancing in networks is ECMP – Equal-Cost Multi-Path. ECMP is simple to implement, but has the following undesirable attributes
- All packets of a given flow travel the same path. If flow volumes are very different, path utilization may be very different (e.g. Heavy flows may overwhelm some paths, even while low-volume flows leave spare capacity in other paths).
- ECMP sends traffic to paths regardless of their state, which ECMP does not check. So, if a path becomes congested ECMP will keep using it regardless. If a path has a link-failure, ECMP will only stop using it after routing protocols converge to tell the relevant switch that the path is no longer available.
MPTCP
Another Load-balancing option is MPTCP, which splits a TCP flow into several sub flows at the sending host, with a dedicated congestion algorithm. The problem is that MPTCP (1) needs to be relevant – you need to be able to replace TCP with MPTCP at the hosts. (2) It can actually increase congestion for incast use-cases, and its congestion algorithm works best if most flows are steady-state, and long-lived, and DC traffic in reality has lots of transient flows.
CONGA Summary
The Authors are trying to implement a better Load-balancing solution, and want to avoid having to modify the Transport layer.
They present the following Load-balancing Taxonomy, showing where CONGA fits.
They prove that using LOCAL congestion knowledge (i.e. congestion of the ports of the switch doing the load-balancing path-selection) is not enough to deal with downstream asymmetry, and May, in fact, make things worse:
“Static ECMP splits the ?ows equally, achieving a throughput of 90Gbps because the ?ows on the lower path are bottlenecked at the 40Gbps link (S1, L1). Local congestion-aware load balancing is actually worse with a throughput of 80Gbps. This is because as TCP slows down the ?ows on the lower path, the link (L0, S1) appears less congested. Hence, paradoxically, the local scheme shifts more traf?c to the lower link until the throughput on the upper link is also 40 Gbps.”
Leaf-to-Leaf congestion Measurement & notification
CONGA uses Leaf-to-Leaf congestion notification, where “leaf” is the location where Overlay Tunneling (VXLAN, GENEVE, etc.) is done. The Leaf knows the target Leaf (as opposed to normal routing, where each hop knows only the next hop), and the tunneling headers are extended to carry congestion information. Each switch along the way adds local congestion seen for each relevant path, and target leaf receives all congestion measured along all paths, aggregates it, and sends it back to source leaf by piggybacking it on any traffic in the reverse direction.
This Leaf-to-Leaf scheme is provably near-optimal in 2-tier CLOS topologies.
Flowlet Switching rationale
Flowlet switching for Internet (WAN) paths has been shown to be effective. But will it work in DC’s? On the one hand, we’d expect the short RTT and high speed paths to present very few in-flow gaps, but on the other, DC traffic is known to be very bursty.
They measure to show it works for DC traffic too. On a real-life cluster with lots of production applications, they measure the distribution of data-bytes per “flowlet” depending on the gap between successive sequences of packets in the same flow. They analyzed more than 150GB of production compressed traffic traces
The 250ms is 3 orders of magnitude larger, and represents whole-flows, while the other two cases show what would be detected as flowlets for 100 and 500 micro-seconds of inactivity in a flow used as the criterion to detect flowlets.
“Even with an inactivity gap of 500μs, which is quite large and poses little risk of packet reordering in datacenters, we see nearly two orders of magnitude reduction in the size of transfers that covermost of the data: 50% of the bytes are in ?ows larger than ~30MB, but this number reduces to ~500KB for “Flowlet (500μs)”.
Next, they try to estimate the number of concurrent flowlets that need to be supported in a switch. They do this by looking at the number of distinct 5-tuples present concurrently in the production traffic they analyzed. They find the number to be small, with a media of 130, and a maximum under 300. This, they estimate that even for a heavily loaded leaf switch with 400Gbps of traffic, the number of concurrent flows is under 8K, and in fact, later on, they implement per-switch flow-let tables of 64K items.
Conga Design & Implementation
- “The source leaf makes load balancing decisions based on per uplink congestion metrics, derived by taking the maximum of the local congestion at the uplink and the remote congestion for the path (or paths) to the destination leaf that originate at the uplink”
- “The remote metrics are obtained via feedback from the destination leaf switch, which opportunistically piggybacks values in its Congestion-From-Leaf Table back to the source leaf”
- “Load balancing decisions are made on the ?rst packet of each ?owlet. Subsequent packets use the same uplink as long as the ?owlet remains active (there is not a suf?ciently long gap)”
- Congestion estimation is done per-link at each Spine and the result is marked on packets sent out that link. The article shows how they modify/extend the VXLAN header to carry congestion information. They detail a 4 bit LBTag field that partially identifies the relevant path, 3 bit CE field conveying the level of congestion, a 4-bit FB_LBtag and 3-bit FB_Metric used by destination leaves to piggyback congestion information back to source leaves.
- Note later on they state that LBTag can only carry 12 distinct values, not 16.
Sender sets LBTag to the port used, and sets CE to 0. Each switch along the way updates CE if the egress link congestion is higher than already marked in the CE, so when the packet arrives at the destination Leaf-switch (really – Overlay Tunnel termination point), the CE carries the maximum congestion along the path.
Target Leaf stores this value in its “Congestion-from-leaf” table, and waits for an opportunity to piggyback this information on a packet going back to the relevant source leaf-switch.
When a packet IS going in the reverse direction, the target leaf switch inserts ONE value from its “congestion-from-leaf” table, selected in round-robin fashion. This RR algorithm favors (as a heuristic) metrics whose value has changed since the last time a metric was sent back to that source-leaf. In this way, the source leaf gets per-path max-congestion values, one at a time, piggybacked on returning packets. (Note the metric sent back is for a different path – in the other direction and possibly using another port - than the one the reverse-direction packet is actually traveling on!)
Every packet carries SIMULTANEOUSLY both forward and reverse Congestion information. They decided they do not need to generate feedback packets, because they only need a few packets top carry all relevant information. They claim that all metrics between a pair of Leaf switches can be carried in at most 12 packets (because there are only 12 distinct LBTag values). These metric change at Network Round-trip timescales.
Since they don’t have explicit Feedback packets, Metrics can become stale if sufficient packets for piggybacking are not found, so metrics for remote leaves are aged if not updated in a long time (e.g. 10ms), this is done by allowing the metric to gradually decay to 0.
5. Congestion is measured at each link using a Discount Rate Estimator (DRE).
“DRE maintains a register, X, which is incremented for each packet sent over the link by the packet size in bytes, and is decremented periodically (every Tdre) with a multiplicative factor α between 0 and 1:
X ← X × (1 ? α).
It is easy to show that X is proportional to the rate of traf?c over the link; more precisely, if the traf?c rate is R, then X ≈ R · τ, where τ = Tdre/α. The DRE algorithm is essentially a ?rst-order low pass ?lter applied to packet arrivals, with a ( 1 ? e?1) rise time of τ. The congestion metric for the link is obtained by quantizing X/Cτ to 3 bits (C is the link speed)”
DRE is similar to Calculating an exponential moving average, but is cheaper to implement, requiring only one register, and reacts faster to traffic bursts (because X is incremented immediately as bytes are seen).
Flowlet detection
Flowlet are detected and tracked by the sender, The Flowlet table contains Port number, Valid bit, and age bit.
when a packet arrives a 5-tuple hash determines a Flowlet-table slot. If that entry is valid, the flowlet is active and all packets for that flowlet (i.e. having the same Hash-result) will be sent on the indicated port.
If the entry is not valid (was aged out or never filled) a new flowlet is started: a load-balancing decision selected an outgoing ports (e.g. with least congestion). The flowlet table slot is set to Valid and to the resulting egress port to use.
Flowlet entries time out using the AGE bit. Each incoming packet for a given slot resets the age bit. A timer periodically goes over the table and sets to 1 any item whose age bit is 0. If the age bit is found to be already set (meaning no packet has been seen for this slot since the last time the aging mechanism was run) the entry is invalidated. So, if the timer period is Tfl, the mechanism allows detecting gaps between Tfl and 2xTfl. This is less accurate than using full time-stamps, but requires far fewer bits, and is hence cheaper to implement.
Load-balancing decisions
Load balancing decision are done for the first packet of a new flowlet. The sender picks the ports that minimizes the maximum of the local metric (from the local DRE values) and the remote metric (from the Congestion-to-Leaf table). If several ports have same value, a random one is chosen, with attempting to make sure a flowlet only moves if there is a better uplinks than the one its last packet took, even before a gap was seen (i.e if there was a flowlet gap, and the flowlet-table entry is invalid, but a better links is not seen, the new flowlet is sent along the same path as before).
Parameter Choices
CONGA has 3 main parameters
1. Q – the number of bits for quantizing congestion metrics
A Large value improves congestion metric accuracy, but if too large can make leaf switches over-react to minor differences in c9ongestion, and cause oscillatory behavior.
2. τ - the DRE time constant given by Tdre/α
A small value makes DRE respond faster, but also makes it susceptible to noise from transient traffic bursts. Intuitively, it should be set larger than Network RTT to filter the sub-RTT traffic bursts typical of TCP.
3. Tfl - Flowet inactivity timeout
This can be set to the maximum possible Leaf-to-leaf latency to guarantee no packet reordering. If we take worst-case queueing delays into account, that can be a pretty large value (e.g. in their testbed this came to 13ms). Reducing this value presents a compromise between more packet reordering (or at least risk of packet reordering) and less congestion due to better load balancing.
In practice they found CONGA performance to be robust with Q = 3 to 6, τ = 100 μS to 500 μS, and Tfl = 300 μS to 1ms. Default parameter values for were set at Q = 3 to 6, τ = 160 μS, and Tfl = 500 μS
Results
They compared CONGA at two settings of the Flowlet timeout to MCTCP and ECMP. One version of CONGA used default values, and one “CONGA-FLOW” used a flowlet timeout greater than max-network-latency (including queueing) so that it is guaranteed that no packet reordering will occur. This means one LB decision per flow, as in ECMP, but in CONG the choice is informed by the congestion metrics per each possible path (so, in effect, paths are not really “equal cost”, but are weighted by their congestion values)
This was tested on a small testbed (64 servers, 4 switches) , and they tried synthetic benchmarks, incast, and actual application traffic. In addition they also performed extensive simulations.
The “baseline” topology is symmetric, and then they introduce a link-fault to induce an asymmetrical topology
For actual application traffic workloads (on the testbed) they show the following results
This clearly shows CONGA is usually the best performer, and even CONGA-FLOW (no flowlets) is better than ECMP, on a symmetric network topology (i.e. no faults) , and shows MPTCP does not fare well for smaller flows.
When a link-fault is introduced, the results are these:
Here we can clearly see ECMP’s drawback, being unaware of the link-fault, and continuing to send 50% of traffic to the now-impacted spine switch (instead of 33%). And CONGA’s advantage is clear, since it shifts traffic to the lesser congested paths. . Note there is a small, but present advantage to CONGA over CONGA-FLOW.
Note conga flow is still better tan MCTCP, because it proactively detects congestion and avoids it at the start of each flow, and MCTCP is reactive.
They test INCAST using a very small experiment, with one Server repeatedly asking for a 10MB file striped across N other servers (N=1..63), with each source sending back 10MB/N bytes. . Since they only used one requester, this test incast, but does it while the network fabric is at very low utilization, limited by the BW of the requesting server’s NIC. MCTCP fares badly in this experiment.
They then tested with an HDFS Map-reduce workload which writes a 1TB file into HDFS with 3-way replication.
They also ran some larger-scale simulations, simulating a cluster of 384 servers, interconnected with 8 leaf and 12 spine-switches, running a web-search workload. They varied over-subscription ratios from 1:1 to 5:1. They also used the simulations to vary the link speeds, and the amount of asymmetry.
Results of simulations were qualitatively the same as for the test bed.
Conclusions
- CONGA improvement over ECMP is more pronounced at lower load levels the closer the access link speed (at the server) is to the Network fabric link speed. (because if sever link is lower, each fabric link can carry more flows in parallel before becoming congested)
- CONGA achieves near-optimal load balancing across multiple random-failure combinations.
Some additional notes
- The article notes that the fact that each switch makes its own “selfish” decision leads to a global result that may not be optimal. They analyze this “Price of Anarchy” and show that the worst-case result is that conga leads to a global result that is 50% of the optimum.
- They test the effect of work-load typical flows, and conclude that ECMP works well for smaller flows, and CONGA (and in general flowlet usage) mostly benefits heavier workloads with more large flows.
- Since CONGA reacts well to asymmetry, it follows it can be deployed gradually – that is it works well even if CONGA is only applied to a subset of datacenter traffic. Traffic that CONGA does not control simply creates BW asymmetry – congested links that CONGA will route around.
Main conclusions of the paper authors
- Data load balancing for data center is best done in the network, and not at the transport level
- Datacenter load balancing needs global congestion awareness, and local-only information is not enough to handle asymmetry
- CONGA works better than ECMP and MCTCP, especially with heavier workloads and any network asymmetry
Final Thoughts/Notes about this paper
- Flowlet switching by itself is not a new idea, and dates back at least to 2004 (see “Harnessing TCP’s Burstiness with Flowlet Switching”, Hotnets 2004, “Dynamic Load Balancing Without Packet Reordering” SIGCOMM Comput. Commun. Rev., 37(2):51–62, Mar. 2007). This articles, and several others (e.g Clove, LetFlow) all seem to show Flowlet-based path selection leads ti better results - and yet Flowlets seem to not be in widespread usage.
- It is interesting to compare the Load-balancing Taxonomy presneted here with the one prsented in "Clove" (IIGR: Clove - COngestion Aware Load balancing at the Virtual edge)
- A consideration that may limit the utility of flowlet switching, and explain how come it is not universally adopted, may be the size of the flowlet table relative to the number of flows at scale. If we hash 1M flows passing through a switch to a 64K-line flowlet table, we get 16:1 so to detect a Flowlet gap there needs to be a long-enough gap in transmission of all 16 flows hashing to this flowlet gap-timer at once. The more flows, the lower the chances this will happen, with the result tending to behave exactly as ECMP. Note that even then, while we may not see any benefits from Flowlet switching, we will be no worse off than ECMP.
- An issue to consider is also the cost of Out-of-order packets. If we have end-systems (Software Stacks or The NICS in use) that can effectively deal with Out-of-order packets, the flow-let gap timer could be sent shorter. At the extreme, a switch might potentially see each packet as a flowlet, and decide the path for each packet independently.
- For TCP in particular, an occasional out-of-order packet might be a non-event. If the time-difference between arrivals of the out of-order packets is short enough (less than the time for 2-3 duplicate ACKs) than TCP will (usually) handle this in stride
- The article does not detail WHERE in the VXLAN header they put the path-ID and path congestion metric bits exchanged between source and destination leaf switches. If they follow VXLAN’s RFC guidance about bits that MUST be set to 0, it follows they put these bits as art of the VNI field, which would reduce the number of Overlay networks supported per domain from 24 bits to 128K (and in fact, I am guessing they might be using an 8th bit to denote direction, which would leave 64K instances – probably good enough for a given DC sub-domain
- Note CONGA requires both changes in switch hardware (to measure per-=link congestion using DRE and to mark packets with the results) and also requires some non-standard use of Overlay headers, to carry the LBTAGs and CE metric values.
- It seems they COULD have tested incast in conjunction if a high network load by using many servers as Requesters – one is left to wonder why this was not dome, or at least is not reported in the article. Similarly, one wonders why Incast was tested with 10MB file as data and HDFS with a 1TB file – the discrepancy is jarring.