IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

Who: Saleforce, Princeton, VMware & Barefoot.

Original Article here: ACM Library/Clove

TL;DR For the Busy Genius

1.      Sending VSwitch sets different values in L4-Source-port to get packets to take different routes by switches doing their normal ECMP getting different Hash results. Sending VM has no idea this is happening. Sending vSwitch does not know/control/care how ECMP in switches is configured.

2.      Sender sends several Traceroute sets with different L4-source port values to each destination, and creates a map of all paths to a destination from the responses from the switches, then selects the values that will lead to maximum path diversity

3.      Data is sent by flowlets, each flowlet with a different L4-Source-Port to get it to a different path (flowlets = Packets in a given flow separated by a long-enough time gap to prevent later packets sent along a different path to arrive out of order)

4.      New path per flowlet is either randomly selected (“Edge-Flowlet”), or uses ECN or even INT to figure out which path is congested, then reduces the likelihood of using the L4-port value that will make ECMP send the flowlet along the congested path. ECN and INT requires using reserve bits in headers in a proprietary way (they show this done in STT headers)

5.      Compared to CONGA as near-ideal, this gets you most of the benefit with existing switches, and no change to guest VM Stacks. INT version does best, ECN next but even random gets you much better then ECMP

Summary

Key Idea:

… Since ECMP relies on static hashing, the virtual switch at the source hypervisor can change the packet header to influence the path that each packet takes in the ECMP-based physical network.

Details

1)     All the work is done by a bump-in-the-wire Software shim in the sending vSwitch. Sending/receiving VM’s are no involved/modified in any way

2)     Sending vSwitch uses Traceroute to each destination, and repeats with different L4-Source-Port values. It then builds a topology of which value leads to which path being used from the responses coming back. This is essentially reverse engineering to detect wqhat teh ECMP in the switches does.

3)   Sending vSwitch uses Flowlets – and sends each Flowlet with a new L4-Source-port, hence along a different path. The flowlet timeout is set to 2xRTT.

  • Note in other Flowlet-using Load balancing schemes (e.g. Conga, LetFlow) they reach other conclusions as to what the correct Flowlet time-out should be. CONGA in particular uses 500 micro seconds as default

4)   They either choose randomly which path to use for each flowlet, or try to find out which paths are congested, and reduce the probability of using them the more congested they appear.

a)   When the path is chosen Randomly, they call it “Edge-Flowlet

An important observation is that even though selection is random, it still is indirectly Congestion-sensitive. When a flowlet goes into a congested path, any ACKs or responses from receiver take longer to come back, and the sender is more likely to reach the “flowlet Timeout” and reroute to another path. On less congested paths, sender is less likely to see the timeout, and less likely to move. Thus congested paths tend to be less utilized “naturally”

 

b)   To discover Congested paths, they either use ECN, and wait for Receiver to reflect back ECN notices to sender, or use INT to collect congestion along each path and have receiver tell sender the max-congestion-value along each path

  • ECN: Receiver must be able to tell sender WHICH path was congested – by sending back the Value of the L4-src-Port that was used. They can’t set this value as L4-Dest-Port, since this will be taken as “the application”, so they have to put the value somewhere else. They suggest using reserved bits in Tunnel-headers, and show an example using some of the 64-bit “context’ field of STT. They suggest Receiver does not need to respond with ECN to EVERY packet arriving, but only “at ECN-RELAY Frequency”. Receiver should send ECNs to Sender at Higher frequency than sender does path-selection (i.e. Flowlet Time-out). They recommend using RTT/2 for this frequency. The articles calls this variant "Clove-ECN"
  • INT (In band Network Telemetry): INT-Capable switches write into packets the congestion level. Receiver returns the maximum value to the sender. This tells source not only that there WAS congestion (as with ECN) but also how much. This also required reserved bits to indicate WHICH path is involved. The artciles calls this variant "Clove-INT"

An alternative Congestion discovery method they suggest, but did not try, is to use measured Path Latency as Path-selection criteria (e.g. Send time-stamped packets and have receiver return to sender the measured latency)

Parameters and their tuning effect:

  • A Low flowlet time-gap increases packet reordering at the receiver and large flowlet time-gap leads to coarse-grained flowlets, increasing the possibility of congestion.
  • At low ECN-relay frequency, Clove makes suboptimal choices based on stale ECN information, while if it is too high, it would incur high overhead for processing ECN information in the software data path.


  • Clove is robust to small shifts in the Flowlet time-gap and the ECN relay frequency (both between 1-5 RTTs), but is more sensitive to the ECN threshold.
  • We noticed that if the ECN threshold is a few segments above the threshold of the 20 packet limit, Clove reacts very slowly to elephant flow collisions. However, if we set the threshold lower, then Clove would over-react to the typically bursty traffic sent by the TCP segmentation offload.       


Results :

They tested against ECMP (and against a Modified PRESTO) on a small 2-tier leaf-spine PoC network (32 servers, 2 spines, 2 leaves), and against CONGA in simulations of same topology.

They claim to maintain 40Gps per Hypervisor (unclear how many cores/threads per Hypervisor used)

… our edge-based schemes help improve upon ECMP in terms of average and 99th-percentile flow completion time, and that their performance gains get increasingly close to those of hardware-based CONGA. Specifically,
  1. Edge-Flowlet captures some 40% of the performance gained by CONGA over ECMP
  2. Clove-ECN captures 80%
  3. Clove-INT comes 95% close to CONGA’s performance


My take:

  • They do all work in SW, in the Hypervisor. This is good for VMWARE – who are used to forcing their clients to use VMWARE's Software, For all others, it seems It seems this could be/should be offloaded to the NIC, still in a bump-in-the-wire form
  • Seems even just doing the Random-select (Edge-flowlet) is worthwhile over ECMP
  • Even without the Traceroute probes, just doing flowlet-to-L4 Source port values with random values would probably show benefit (especially if you choose Values that have the maximum amount of bit-differences from each other). The sender could also snoop the time for ACKs to come back in response to different L4-Src port values used, and indirectly deduce which values cause the greatest path diversity
  • Their untried idea, to infer congestion from E2E Latency, sounds useful and interesting to try  
  • It is interesting to compare the Load-balancing Taxonomy presented in this article

to the one presented by the authors of CONGA: IIGR: CONGA

要查看或添加评论,请登录

Michael Orr的更多文章

社区洞察

其他会员也浏览了