IIGR: Packet-Level Telemetry in Large DC Networks (“Everflow”)

Who: Microsoft, UCSB, Princeton

Where: SIGCOMM 2015

Original article here: MS Research/Everflow

TL;DR for the genius-in-a-hurry:

Authors make the case that Datacenter networks require packet-level telemetry, and that Static Traces are not enough.

Everflow uses standard commodity-switch features (Mirror-on-match, IP-in-IP and IP-in-GRE encapsulation/decapsulation and ECMP Load-balancing) to implement Packet-level telemetry/debugging at Large-DC scales.

The result is a good approximation of the benefits of INT (IN band Telemetry), but using standard, unmodified commodity switches.

Everflow implementation uses the following elements

  1. Switches along paths are installed with rules that match “interesting” packets, and mirror them (or at least the 1st 64 bytes) to remote Analyzers. Mirrored packets are sent in GRE “envelopes. Packet to be traced (Mirrored) are either matched on Header/payload values or based on a “debug me” flag that re-purposes one of the DSCP bits.
  2. “Probe” packets, with their “Debug me” flag turned on, can be injected at a desired start-of-path switch, by encapsulating them as IP-in-IP addressed TO the switch, and configure a rule on that switch to decapsulate the packet and inject it to normal processing. This way they can send packets engineered to match desired use-cases (with any desired header-values) to start at any given point – and be mirrored along the path at each switch hop.
  3. It is even possible multiple layers of IP-in-IP-in-IP etc. to do a form of Source-routing, so that it is possible to send a Probe packet along a pre-selected set of switch hops.
  4. In order to be able to scale this system to Tbps, t mirrored packets to are sent to switches configured to be load-balancers to a set of Analyzer programs running on servers. This allows scaling out by adding load-balancer switches and/or analyzer instances as needed. Since all such Load-balancing is based on the 5-Tuple values, all packets of a given flow are sent to the same analyzer, allowing analysis of a packet at multiple point along a flow, each of which did its “match and mirror” independently.
  5. Since even the small percentage of traced packets sent to the analyzer can still be too large to save completely, the system saves packet traces that display an issue (e.g. Packet drops, routing loops, etc.) and calculates various cumulative counters for the rest (e.g. Link latency and usage stats) and only the Counter values are periodically saved.

Suitable application make use of the saved packet traces, and collected/calculated Metadata to allow debugging of network issues such as silent drops, various “gray” faults, mis-performing ECMP, Low RDMA throughput, PFC storms, etc. The system is easily extensible to trap any packets of interest, and so collect or calculate additional metadata and flow-cumulative attributes 

Possible Issues with Everflow (not discussed in the article) are its re-use of a DSCP bit, and the fact that it uses TCAM rules for its fundamental match-and-mirror operation, which can create problems with other switch features that rely on TCAM rules, due to the “first and only match” nature of TCAMs. 

Article Summary

“Effective diagnosis requires intelligently tracing small subsets of packets over the network, as well as the ability to search the packet traces based on sophisticated query patterns, e.g., protocol headers, sources and destinations, or even devices along the path”

The authors cite some based-on-experience examples of network issues that occur in Data-center networks that are hard to debug with conventional means, and require packet-level tracing/telemetry to resolve

  • Silent packet drops – where packets are dropped but drop counters do not reflect this (due to SW or HW bugs)
  • Silent Black holes – A routing black hole that does not show up in routing tables (e.g. due to corrupted TCAM entry)
  • Inflated Latency for a flow – SYMPTOM is easily detected at end-hosts, but how to find which switch causes this?
  • Routing loops due to "middle boxes" – When a middle box wrongly modifies headers, causing a routing loop. All switch routing 6tables are correct, so hard to figure out w/o packet tracing
  • Load Imbalance – how to tell is Flows were very different in Volume, or switch ECMP hash worked poorly? 
  • Protocol Bugs that induce network issues (BGP, PFC, RDMA)

Scale

They aim at a system to deal with a large-scale DC network – 100K Servers, 100TBps Traffic, hundreds of switches.

They note that even if traced packets are truncated to 64 bytes you can’t trace ALL packets (as was proposed by one of the articles they cite) because just the trace-packets will consume 32Tbps which by itself will disrupt the network, and analyzing 32Tbos of traffic will require lots of compute power (e.g. if every analysis core can do 10Gbps, you’ll need 3200 cores).

Key Ideas

Match-and-Mirror

They rely on the already-existing capability of commodity switches (most likely based on Broadcom Silicon) to match packet headers and carry out a MIRROR operation when a match is found, where the original packets is forwarded normally, and a copy is generated (and optionally truncated to 64 bytes) and sent to a desired Destination IP address.

They design two ways to use this capability

  1. Install rules that match header values e.g. match TCP packets with SYN, FIN or RST flags, so as to capture start and end of TCP sessions, or match on Specific L4 Port numbers to debug traffic of a specific application, or to trace packets of a specific protocol (e.g. BGP, RDMA control-plane)
  2. Mark packets at a source point with a special “Trace-me” Bit and match on this bit at all downstream switches (see section about “Guided Probes” below) 

Load-balancing and “reshuffling” of mirrored packets

Even though Everflow will only mirror a very small, selected subset of packets (and in most cases will only mirror the 1st 64 bytes of each), at Large Data-center scales, it might still generate a significant amount of traffic, more than a single Analyzer can handle. So,

  1. Everflow needs to be able to load-balance the mirrored traffic it generates to a set of Analyzers.
  2. At the same time, Everflow needs to send all mirrored traffic of a given traced flow to the same analyzer, from multiple separate switches along the packet’s path.


This is done leveraging the ideas in DUET, an article from SIGCOMM 2104 (one year earlier than EverFlow), where existing switch capability to do ECMP-based load-balancing with their ability to encapsulate, at wire speed, packets into IP-in-IP tunnels is used.

Each Analyzer instance is given a Direct IP address (DIP). A switch is configures as a MUX (as per the ideas in DUET) and a Virtual IP address (VIP) is defined on it to represent the analysis service. All traced traffic is addressed to this VIP, which means it will get to the switch configured as a MUX, and that switch will select an actual Analyzer and send it there load balancing between all analyzer instances. Since the load-balancing is ECMP-based, the analyzer selected for all packets having the same hash-results (and so, for all packets of the same flow) will be sent to the same analyzer.

This allows scalable analysis – we can add Switches-configured-as-MUX, where they can all use the same VIP (And then traffic TO the VIP will be load-balanced), and/or we can add analyzer instances as needed.

This means traced packets can scale to Tbps of traffic, and load-balancing them to multiple analyzers does not add significant latency – typically just a single Switch-transit time for a (usually) minimum-sized packet. 

Handling already-tunneled packets

If a packet is already tunneled (e.g. for load-balancing or Network-virtualization) then rules to do the MUXing in switches that serve as traced-traffic load-balancers can be installed that will match on the Destination IP of the inner “passenger” packet, as opposed to the outer “envelope” 

Guided Probes

Everflow allows the user to inject a packet with headers that match the flow to be debugged, but with a “trace-me” bit in the header, which will cause any properly-configured switch along the path to trace that packet. This way the user can “replay” what treatment packets of a given flow got at each switch [or try out a new configuration, to verify it WILL do the correct thing!] 

By injecting multiple probe packets with the same header values it is possible to look for periodical or intermittent issues and to differentiate between transient and persistent effects.

By injecting probe packets with varying header values, we can test if issues are specific to a given flow/application or more general/random.

They even extend probe-packet handling to implement a system of source-routing for probe-packets, so they can be sent (and traced) along a desired path. This can be used in many ways, and an example they give is to get a good approximation of a link’s latency from switches that do not support time-stamping, as in Figure 2. A probe packet traverses the S1-to-S2 link twice, and is traced at each end – which is to say a copy is sent to the same analyzer when the packet is switches through S1 towards S2, and when it is switched by S1 coming FROM S2. Since the path from S1 to the analyzer is the same in both cases, the time-difference between the two mirrored-packets is a good approximation of the S1->S2 link’s RTT

EverFlow re-purposes one of the DSCP bits as the "trace me" Debug bit.

Note that while this is not discussed in the article, re-purposing a DSCP bit may mean the network QoS policies will have to be adjusted not to make use of this bit for its usually intended role.



EverFlow Implementation

EverFlow has the following components

1.      Applications

·      Programs that use the packet-level information produced by EverFlow as a whole to debug traffic patterns and issues. For example: Latency Profiler, Packet drop debugger , Loop debugger, ECMP profiler , RoCEv2/RDMA debugger

 

2.      Analyzers

A set of Standard Compute servers, each of which receives a portion of traced traffic to process. See below for details



3.      Reshuffler (Load-balancer)

Load-balances traced traffic among analyzers while ensuring al;l traffic of a given flow (all traffic with the same 5-tuple) will be sent to the same analyzer, from all points in the network. Does this by programming switches to behave as MUXes, as described above.



4.      Controller

Coordinates the other components and exposes an API for applications to use, to guide which flows are of interest, inject guided-probe packets, etc. The controller  configures switches with the needed match-and-mirror rules to send mirrored packet to the reshufflers (which will then send mirrored packets to the analyzers)

5.      Storage

Keeps data from analyzers that needs to be stored long-term (see below)


Analyzer operations

 

·      Packet Traces – Analyzers keep an in-progress table of packet traces. For each of each mirrored packet one full copy is kept, and per-hop metadata for this packet from each mirroring point, this chain is terminated when no new data arrives for 1 second. A trace is identified by its 5-Tuple and IPID of the original packet. Metadata collected is IP of the switch where the packet mirroring occurred, Timestamp, TTL, Source MAC address (to identify previous hop), and DSCP/ECN bits. AN analyzer uses this information to check for routing loops, and for packet drops. If packets were encapsulated by a software Load balancers, then analyzers can extract and use the inner packet’s headers.

·      Summary Counters – even though only the subset of packet matched by mirroring rules is sent to the analyzers, with load-balancing making it possible to use as many analyzers as needed, this can still be too many to keep packet traces for all. For this reason, analyzers keep most packet traces on a temporary basis. Traces are saved to Storage only if they either reflect an issue (routing loops or packet drops), have debug bit set (i.e. are guided probes) or are Protocol traffic (e.g. BGP packets, RDMA control plane packets ). For all the rest, analyzers periodically summarize their contents into a set of counters, and only write to storage the counter values, discarding the packet trace itself. Counters kept are

1.      Link load counters – from packet traces that crossed a given link, analyzers can count load in total (number of packets, number of bytes , number of flows, etc.) and finer-grained counters as needed (e.g. load generated by a given application, specific source r destination subnet, etc.). Analyzers can be dynamically directed which counters to keep as summaries for discarded packet traces.

2.      Latency Counters – Analyzers calculates latency for each link from the time-stamps of packet at each mirror point (and when they arrived at the resfhufflers and analyzer)

3.      Dropped Mirrored packets - it is possible that mirrored packets will themselves be dropped. This can usually be inferred by having mirrored packet reach the analyzer from some switches along the path and not from others. This may be due to congestion near a reshuffle or analyzer, and this counter value can be used to decide to add or move analyzers and/or reshufflers

Implementation notes

1.      They use one of the DSCP bits as the “trace me” Debug bit. This means the Data-Center QoS operations must be made to not use this bit, which may require adjustments

2.      They use bits from IP ID field to create their own sampling system. For example, they claim that by matching on a specific value of 10 bits from IP ID field, they effectively sample one out of 1024 packets. Since all switches are configured with the same value, they will all sample the same packets, thus tracing the sampled packets all along the path. This assumes IP ID values are either randomly assigned (which would make the sampling random), or a counter value (in which case sampling will be triggered on every Nth packet). In general, IP ID values set by hosts can deviate from this assumption (e.g. See A closer look at IP-ID behavior in the Wild), but in a Data-center context, the DC operator may have enough control over host to enforce the desired behavior. 

Note that strictly speaking, using the IP ID fields in this way, violates RFC6864 ( Updated Specification of the IPv4 ID Field) which states that IP ID filed should only be used for De-fragmentation and packet de-duplication, tough such uses of the IP ID field are not uncommon.

3.      Encapsulated (Tunneled) packets are handled by configuring rules to allow matching on the TCP/IP headers of the inner “passenger” packet. This means that only switches that support this capability (which is common, but not universal) can be used.

4.      Everflow uses TCAM-installed rules to configure switches to do the Mirror-on-match. Given the “first and only match” nature of most TCAM implementations, it will mean that any packet matched by Everflow can’t be matched by any other TCAM rule. While the authors do not discuss this, it may need to be mitigated, especially since EverFlow, by default (and naturally) tries to, match and mirror the more “important” packets – TCP start/end of session, Routing and RDMA control packets, etc.

I assume this is, or at least can be mitigated in various ways :

·      EverFlow might be given a dedicated TCAM matching stage, since many switches support 2 or 3 rounds of TCAM matching

·      

If a TCAM rule is installed by any other system (e.g. a security ACL, or a QoS Policy), the Switch might need to merge the rules, so that upon match both the Everflow “mirror and encapsulate” and the actions imposed by the rule’s other purposes are both applied to the packet. This merging would have to account for inter-effects (e.g. if a packet is to be dropped by an ACL and also matches an EverFlow rule – should it be Mirrored? Or suppose a Policy induces changes in the packet, as would be the case for Encapsulating it for Network virtualization of QoS re-marking – should the packet be mirrored before or after these changes?)

5.      Everflow sends Mirrored packets to the Reshufflers encapsulated into GRE , and formatted as shown

Guided Probe packets

1.      To inject a probe packet to start at a desired switch, the packet is encapsulated and sent to the target switch’s Loopback address as destination. The target switch is configured with a rule to match such incoming packets, decapsulate them, and the inner, probe passenger (which will have it’s “debug” flag on) will then be processed.

2.      TO send a packet along a desired route, it is encapsulated multiple times, with each layer specifying the IP Loopback address of the next hop switch in the path.

3.      In order to ensure Probe packets do not interfere with any Server operations, their TCP/UDP checksums are deliberately set to the wrong values, so that even if they are not discarded by the last-hop switch, they will be discarded by servers.

Everflow Usage Experience

When the article was published, Everflow was in use, on a pilot basis at Microsoft Data center on about 500 switches. It is unknown if it is still being used. The cite several examples






Final Thoughts/Notes 

1.      My hat is off to Eveflow’s team. They manage to get a huge advance in network visibility, a reasonable approximation to what you can get with In-Band-Telemetry (INT), but using the already-installed non-INT capable switches

2.      I can only wonder why I have not heard of Everflow (and similar) system being in wide-spread usage.

Precision Telemetry in the WAN needs to be done outside the core routers. This approach is very applicable when router integrated telemetry isn’t possible.

要查看或添加评论,请登录

Michael Orr的更多文章

社区洞察

其他会员也浏览了