IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching
Michael Orr
Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader
Who & Where: MIT, Cisco (Note overlap with authors of CONGA) , NSDI 2017
Original article here : https://people.csail.mit.edu/alizadeh/publications.html
TL;DR for the genius-in-a-hurry:
LetFlow is a Simplified variant of CONGA, 3 years later. Both use Flowlet Switching, but where CONGA collects and communicates Leaf switch to Leaf switch Congestion data to select the path for each flowlet, LetFlow simply selects a new path for a flowlet at random from available paths.
CONGA LetFlow
- Switches along the way mark passing Packets 1. (nothing similar) to indicate level of congestion
- Target Leaf-switch aggregates congestion information 2. (nothing similar) in received traffic opportunistically sends it back to the source leaf-switch, piggy-backed on traffic going that way
- Sending Leaf switch uses Congestion information it got 3. Sending leaf switch to decide on the path for the next flowlet (after next gap) randomly selects path owlet
Surprisingly, this gets you 50-85% of the benefits of CONGA, which is considered near optimal Load-balancing by many, for a much lower implementation cost.
LetFlow’s advantage over ECMP, same as CONGA, increases for an Asymmetric Topology (e.g. one with a failed link, or an especially heavy congestion)
Authors show that flowlets change size – become shorter on Congested paths, and longer on non-congested ones. They show this is a key insight behind LetFlow’s performance.
I think this is a different aspect of an observation made in “Clove” Article: Even when flowlets are sent randomly, they are still indirectly congestion-sensitive. Packets sent along a congested path arrive more slowly, hence slow down ACK’s/responses, This tends to create more Flowlet-gaps, making the flowlet more likely to be re-routed to a new, less congested path.
Article Summary
This article came out 3 years after CONGA, and it is worthwhile to note the overlap between the authors of both articles, in Tom Edsall, and Mohammad Alizadeh.
LetFlow is a simplified version of CONGA, and the authors show that for a much lower implementation cost you get the lion’s share of the benefits of CONGA. The article investigates while this unintuitive result happens
The main idea/difference here is that path-selection for each flowlet is done at random, as opposed to CONGA which worked hard to collect congestion information along each path, communicated it from Edge-switch to Edge-switch, and used this information when doing path selection.
LetFlow
One of the goals of a DC network is to load-=balance traffic across available paths, so as to avoid having some paths overloaded while other paths stand under-utilized. The de-facto standard for doing that is ECMP. ECMP sends all packets of a flow (i.e. who get the same result from hashing their 4-tuple or 5-tuple heard values) along the same downstream path at any switching point.
ECMP guarantees no out-of-order packets (because all packet of a given flow are sent along the same path), but has well-known issues, especially when networks are “asymmetric” – either because of some Network fault (e.g. a failed link or switch) or because of very different loading of paths. ECMP can easily send several “Heavy” flows to a given path, while sending “light” flows along another path, thus CREATING imbalance in network usage.
There are LOTS of papers on alternatives to ECMP for traffic load balancing, each with different attributes. The authors present this taxonomy:
Experience shows that dealing with Asymmetry in the networks – not all paths are equally available – is a major consideration, because (1) This is a common occurrence in Hyper-scale data centers and (2) When it occurs, it has major performance implications. For these reasons the table above calls it out specifically.
The article refers a lot to a CONGA, and in general, many articles about Load-balancing traffic in DC networks refer to CONGA, and present it as a near-optimal solution. For example, Figure 2 of the article shows Flow-completion time for ECMP (per flow load balancing) and CONGA (per flowlet) and, as an extreme case, what happens if you make a path-selection decision per packet, with static weights (i.e. they don’t even have to be “equal weights”, just statically assigned, say in proportion to the path bandwidth)
The problem of handing asymmetry is that the correct allocation of traffic-to-paths generally depends on real-time traffic demands and congestion, and so ANY static traffic-to-path weighing is generally “wrong”.
“In summary, existing load balancing designs either explicitly use global path congestion information to make decisions or do poorly with asymmetry. Also, static weights based on the topology or coarse traf?c estimates are inadequate in dynamic settings.”
The main Idea of LetFlow is to simplify CONGA.
CONGA proposed the following key building blocks
- Do load-balancing (i.e. make path-selection decisions) by flowlets in switches
- All switches along the way mark passing packets to indicate the level of congestion at the outgoing Interface
- Target Leaf-switch opportunistically piggy-backs congestion information it received and sends it back to the source leaf-switch
- Sending switch uses the feedback about congestion level on each path to do the path selection.
LetFlow essentially does only step #1, and replaces steps 2..4 by randomly selecting a path per flowlet from the available paths towards the target.
The results, on a small test-bench and a simulation of a larger topology, show that
- LetFlow is always better than ECMP
- LetFlow achieves most of the benefits of CONGA
- While CONGA was designed for a 2-tier network, LetFlow works also for Multi-Tier network topologies
- LetFlow also plays well with various transport protocols, and benefits are relaticvely larger when switches are relatively shallow-buffered – which occurs more frequently as link-speed increase
Flowlets are GOOD
A flowlet, for both CONGA and LetFlow, is a set of packets that belong to a given flow, but are separated from previous packets of the same flow by a long-enough gap in time, so that if we send the packets after the gap along a different, possibly shorter/faster path we will still not get out-of-order-packets at the target.
Flowlet arise because of many reasons. TCP normal operation is to send a group of packet and then stop and wait for an ACK to come back, leading to Flowlets, but more generally, many/most applications tend to have gaps in communications, while they calculate new payload to send, or deal with payloads received in the current batch of packets .
This is done by setting a timer to measure how long since the switch forwarded a packet in a given flow, and setting a threshold value – the flowlet timeout – to decide that the gap is “long enough” and it is safe to declare the next packet in this flow to be the beginning of a new flowlet. Intuitively a timeout that is larger than the maximum one-way-delay (including queuing time) will be 100% safe – guaranteed to never cause out-of-order delivery.
“More precisely, the gap between two ?owlets must exceed a threshold (the ?owlet timeout) that is larger than the difference in latency among the paths; this ensures that packets are received in order”.
A timeout set VERY short will cause us to send each packet along a different path (as in Figure 2 above). A timeout that is very long will cause us in effect to do ECMP and send ALL packet of a given flow along the same path.
An important result of using flowlets is that Flowlet sizes change (i.e. # of packets between flowlet-gaps) to adapt to path congestion. On two paths with relative capacities of 100% and 50%, ideal load balancing would send traffic at 2:1 ratio. If for any reason (ECMP because it does not know current congestion level and LetFlow due to random selection) flows are allocated wrongly – Flowlet sizes changes dramatically to compensate – and this happens “automatically”
“If the load balancer uses the correct weights (66% for S0, 33% for S1) then the ?owlet sizes on the two paths are roughly the same. But if the weights are incorrect, the ?owlet sizes adapt to keep the actual traf?c balanced. For example, even if only 2% of ?owlets are sent via S0, the ?owlets on the S0 path grow ~ 65× larger than those on the S1 path, to keep the traf?c reasonably balanced at a 57% to 43% split”.
In another experiment, when available bandwidth is 4:1, as in one path at 40Gbps and the other at 10Gbps, LetFlow arrives (after some delay) at a very good proportional load-balancing state:
Flowlet Time-out Trigger value
It is clear that the Flowlet-timeout (Δ) value is critical to any flowlet-based load-balancing scheme. A very-large timeout value will cause the load-balancer to behave like ECMP, and a very small one will cause the load-balancer to re-route every packet, potentially causing a lot of Out-of-Order packets to be delivered.
The authors both test empirically which value should be used as the Flowlet-detection gap, and develop a formula based on Markov-chain analysis for this purpose. The main results from this analysis are as follows
When sending 100 similar flows along two paths of 10G and 40G, the ideal load-balanced distribution is 4:1, with 80 flows on one link, and 20 on the other. They test what the effect is of changing Time-out values from 50-300μS
As can be seen, if time-out value is either too small or too large, you get 50%-50% and 65%-35%, but for a median range of 300μS -500 μS you get very close to the ideal distribution
They also wanted to see how robust this choice is, and ran simulations to obtain Δ-min Δ-max , the values for which load-balancing is close to Ideal, as a function of the average flow-size. As can be seen, good results are obtained at the 500 μS value
Final Thoughts/Notes
- Once again, random selection proves surprisingly effective, compared to sophisticated efforts (e.g. as was shown in Jellyfish: Networking Data Centers Randomly, where randomly wiring servers in a data center proved as efficient as carefully designed topologies). Another, related example, can be seen in CLOVE, where “edge-flowlet” performed quite well (in CLOVE, sending Node load-balances by flowlets between available paths, using clever hack of changing the value of the Source-TCP/UDP port to make switches send the flowlet along different paths)
- The authors note that LetFlow (like CONGA before it) WAS implemented in silicon for a major Datacenter (one can guess at the identity of the Data-center in question by the authors of this article and CONGA before it). Given that this is available in Silicon, and the clear benefits shown here (and in many other articles about Flowlet-switching) I have to wonder why flowlet-based load-balancing is not the de-facto standard.
- Flow-Hash-collisions and Flowlet tables. If a LARGE number of flows passes through a switch (or, in the case of CLOVE though the sending SHIM) then it can be seen that since flows, and flowlets, are selected by hashing header values, then we may well have hash-collisions. For example, if we have a 64K-entry flowlet table and we are switching 1M flows, then on average 16 flows will hash to the same “flowlet ID’ (note, same thing will happen with ECMP). This means that a Flowlet-gap will be detected only of all 16 flows hashing to the same flowlet-table entry will all have a Gap between successive packets longer than the Flowlet-gap-timeout value AT ONCE. This will reduce, and perhaps eliminate any benefits of flowlet-based load-balancing. Note that this will not make things worse – it will basically behave as ECMP, since no flowlet-gaps will be detected. This can be mitigated in many ways, by either making the flowlet table bigger, or by sending only a SUBSET of the flows to the flowlet-table. For example, if we do any form of “Elephant/Mice” flow classification, we could send all Mice flows using ECMP, and only apply Flowlet-switching to large flows. As per the measurements reported by Google in their “Andromeda” paper, there are far fewer “elephant” flows, and so the amount of hash collisions will go down dramatically. Another option, of course, is to assign priorities (By application, or by Sender, etc.) and only apply flowlet-switching to the high-priority flow.