登录查看更多内容

IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

Michael Orr

Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader

发布日期: 2018年9月23日

+ 关注

Who: Saleforce, Princeton, VMware & Barefoot.

Original Article here: ACM Library/Clove

TL;DR For the Busy Genius

1. Sending VSwitch sets different values in L4-Source-port to get packets to take different routes by switches doing their normal ECMP getting different Hash results. Sending VM has no idea this is happening. Sending vSwitch does not know/control/care how ECMP in switches is configured.

2. Sender sends several Traceroute sets with different L4-source port values to each destination, and creates a map of all paths to a destination from the responses from the switches, then selects the values that will lead to maximum path diversity

3. Data is sent by flowlets, each flowlet with a different L4-Source-Port to get it to a different path (flowlets = Packets in a given flow separated by a long-enough time gap to prevent later packets sent along a different path to arrive out of order)

4. New path per flowlet is either randomly selected (“Edge-Flowlet”), or uses ECN or even INT to figure out which path is congested, then reduces the likelihood of using the L4-port value that will make ECMP send the flowlet along the congested path. ECN and INT requires using reserve bits in headers in a proprietary way (they show this done in STT headers)

5. Compared to CONGA as near-ideal, this gets you most of the benefit with existing switches, and no change to guest VM Stacks. INT version does best, ECN next but even random gets you much better then ECMP

Summary

Key Idea:

… Since ECMP relies on static hashing, the virtual switch at the source hypervisor can change the packet header to influence the path that each packet takes in the ECMP-based physical network.

Details

1) All the work is done by a bump-in-the-wire Software shim in the sending vSwitch. Sending/receiving VM’s are no involved/modified in any way

2) Sending vSwitch uses Traceroute to each destination, and repeats with different L4-Source-Port values. It then builds a topology of which value leads to which path being used from the responses coming back. This is essentially reverse engineering to detect wqhat teh ECMP in the switches does.

3) Sending vSwitch uses Flowlets – and sends each Flowlet with a new L4-Source-port, hence along a different path. The flowlet timeout is set to 2xRTT.

Note in other Flowlet-using Load balancing schemes (e.g. Conga, LetFlow) they reach other conclusions as to what the correct Flowlet time-out should be. CONGA in particular uses 500 micro seconds as default

4) They either choose randomly which path to use for each flowlet, or try to find out which paths are congested, and reduce the probability of using them the more congested they appear.

a) When the path is chosen Randomly, they call it “Edge-Flowlet”

An important observation is that even though selection is random, it still is indirectly Congestion-sensitive. When a flowlet goes into a congested path, any ACKs or responses from receiver take longer to come back, and the sender is more likely to reach the “flowlet Timeout” and reroute to another path. On less congested paths, sender is less likely to see the timeout, and less likely to move. Thus congested paths tend to be less utilized “naturally”

b) To discover Congested paths, they either use ECN, and wait for Receiver to reflect back ECN notices to sender, or use INT to collect congestion along each path and have receiver tell sender the max-congestion-value along each path

ECN: Receiver must be able to tell sender WHICH path was congested – by sending back the Value of the L4-src-Port that was used. They can’t set this value as L4-Dest-Port, since this will be taken as “the application”, so they have to put the value somewhere else. They suggest using reserved bits in Tunnel-headers, and show an example using some of the 64-bit “context’ field of STT. They suggest Receiver does not need to respond with ECN to EVERY packet arriving, but only “at ECN-RELAY Frequency”. Receiver should send ECNs to Sender at Higher frequency than sender does path-selection (i.e. Flowlet Time-out). They recommend using RTT/2 for this frequency. The articles calls this variant "Clove-ECN"
INT (In band Network Telemetry): INT-Capable switches write into packets the congestion level. Receiver returns the maximum value to the sender. This tells source not only that there WAS congestion (as with ECN) but also how much. This also required reserved bits to indicate WHICH path is involved. The artciles calls this variant "Clove-INT"

An alternative Congestion discovery method they suggest, but did not try, is to use measured Path Latency as Path-selection criteria (e.g. Send time-stamped packets and have receiver return to sender the measured latency)

Parameters and their tuning effect:

A Low flowlet time-gap increases packet reordering at the receiver and large flowlet time-gap leads to coarse-grained flowlets, increasing the possibility of congestion.
At low ECN-relay frequency, Clove makes suboptimal choices based on stale ECN information, while if it is too high, it would incur high overhead for processing ECN information in the software data path.

Clove is robust to small shifts in the Flowlet time-gap and the ECN relay frequency (both between 1-5 RTTs), but is more sensitive to the ECN threshold.
We noticed that if the ECN threshold is a few segments above the threshold of the 20 packet limit, Clove reacts very slowly to elephant flow collisions. However, if we set the threshold lower, then Clove would over-react to the typically bursty traffic sent by the TCP segmentation offload.

Results :

They tested against ECMP (and against a Modified PRESTO) on a small 2-tier leaf-spine PoC network (32 servers, 2 spines, 2 leaves), and against CONGA in simulations of same topology.

They claim to maintain 40Gps per Hypervisor (unclear how many cores/threads per Hypervisor used)

… our edge-based schemes help improve upon ECMP in terms of average and 99th-percentile flow completion time, and that their performance gains get increasingly close to those of hardware-based CONGA. Specifically,

Edge-Flowlet captures some 40% of the performance gained by CONGA over ECMP
Clove-ECN captures 80%
Clove-INT comes 95% close to CONGA’s performance

My take:

They do all work in SW, in the Hypervisor. This is good for VMWARE – who are used to forcing their clients to use VMWARE's Software, For all others, it seems It seems this could be/should be offloaded to the NIC, still in a bump-in-the-wire form
Seems even just doing the Random-select (Edge-flowlet) is worthwhile over ECMP
Even without the Traceroute probes, just doing flowlet-to-L4 Source port values with random values would probably show benefit (especially if you choose Values that have the maximum amount of bit-differences from each other). The sender could also snoop the time for ACKs to come back in response to different L4-Src port values used, and indirectly deduce which values cause the greatest path diversity
Their untried idea, to infer congestion from E2E Latency, sounds useful and interesting to try
It is interesting to compare the Load-balancing Taxonomy presented in this article

to the one presented by the authors of CONGA: IIGR: CONGA

要查看或添加评论，请登录

Michael Orr的更多文章

Can you Apply Kelly's Formula to Investing ?

2023年1月24日

Can you Apply Kelly's Formula to Investing ?

Kelly's formula (sometimes called "fortune's formula") is widely touted as applicable to Investing, and in particular…

2 条评论
IIGR : HPCC: High Precision Congestion Control

2020年8月11日

IIGR : HPCC: High Precision Congestion Control

IIGR[1]: HPCC: High Precision Congestion Control Who : Alibaba, Harvard University, U. of Cambridge, MIT Where: ACM…

1 条评论
IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

2018年12月27日

IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

Who : Stanford: (Yilong Geng, Shiyu Liu, Zi Yin, Balaji Prabhakar, Mendel Rosenblum), Google: (Ashish Naik, Amin…

3 条评论
IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

2018年11月25日

IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

IIGR: Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network Who : Google ,…

1 条评论
KLAT2’s Flat Neighborhood Network

2018年11月12日

KLAT2’s Flat Neighborhood Network

Original Article here: Citeseerx or https://aggregate.org/FNN/ Who: U.
IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

2018年11月12日

IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

Who & Where: MIT, Cisco (Note overlap with authors of CONGA) , NSDI 2017 Original article here : https://people.csail.
Sonata: Query-Driven Streaming Network Telemetry

2018年11月12日

Sonata: Query-Driven Streaming Network Telemetry

Who: Princeton University, KAUST, NIKSUN Where: Apparently prepared for HOTNET16 (https://www.cs.
IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

2018年10月26日

IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

Who: Microsoft, UCSB, Princeton Where: SIGCOMM 2015 Original article here: MS Research/Everflow TL;DR for the…

1 条评论
IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software

2018年10月26日

IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software

Who : Microsoft, Purdue, Yale Where: SIGCOMM 2014 Original article: ACM Library/Duet (slides : SIGCOMM 14/Duet Slides)…

3 条评论
IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

2018年9月23日

IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

Who & Where : Microsoft & Cisco, SIGCOMM 2014 Original articles here : Semantic Scholar/Conga TL;DR for the…

See all articles

IIGR: "Clove" - Congestion-Aware Load Balancing at the Virtual Edge

Michael Orr

Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader

TL;DR For the Busy Genius

Summary

Key Idea:

Details

Parameters and their tuning effect:

Results :

My take:

Michael Orr的更多文章

社区洞察

其他会员也浏览了

Troubleshooting and Resolving Performance Limitations in Azure Virtual Desktop (AVD) Environments: A Comprehensive Guide

???? Optimizing Server Performance with Accurate Traffic Count and Classification

Understanding Broadcom’s StrataXGS? Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

How to set up and manage a Hyper-V Failover Cluster, Step by step

What is the important role of NFV？

13 Scalability Costs to Consider when Pushing Data to More Users

Why chmod 744 is used in VM infra ?

Utility of Virtual Machines in the Industrial OT Environment

Azure Weekly Updates - March 12th, 2024

Software Defined Storage is about Control

TL;DR For the Busy Genius

Summary

Key Idea:

Details

Parameters and their tuning effect:

Results :

My take:

Michael Orr的更多文章

Can you Apply Kelly's Formula to Investing ?

IIGR : HPCC: High Precision Congestion Control

IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization

IIGR: Jupiter Rising: A Decade of CLOS Topologies and Centralized Control in Google’s Datacenter Network

KLAT2’s Flat Neighborhood Network

IIGR: Let it Flow: Resilient Asymmetric Load Balancing with Flowlet Switching

Sonata: Query-Driven Streaming Network Telemetry

IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

IIGR: Duet: Cloud Scale Load Balancing with Hardware and Software

IIGR: "CONGA" - Distributed Congestion-Aware Load Balancing for Datacenters

社区洞察

其他会员也浏览了

Troubleshooting and Resolving Performance Limitations in Azure Virtual Desktop (AVD) Environments: A Comprehensive Guide

???? Optimizing Server Performance with Accurate Traffic Count and Classification

Understanding Broadcom’s StrataXGS? Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

How to set up and manage a Hyper-V Failover Cluster, Step by step

What is the important role of NFV？

13 Scalability Costs to Consider when Pushing Data to More Users

Why chmod 744 is used in VM infra ?

Utility of Virtual Machines in the Industrial OT Environment

Azure Weekly Updates - March 12th, 2024

Software Defined Storage is about Control