Gaining Visibility To Discern Server I/O vs Network Issues. A Guide for the Server Team
Introduction
In data center infrastructure, the lack of full network visibility down to the server itself creates serious limitations for IT management when troubleshooting problems. Is the problem in the server, or NIC card, or the ToR switch, or something deeper in the data center network?
This document outlines the path customers can take to quickly identify if the server or the network is the root cause of I/O performance issues. In general, Enterprise data centers are adopting turn-key technologies to drive agility and cost efficiencies. The network layer is typically the least agile part of data center infrastructure to design, configure, and operate – especially when compared to compute and virtualization infrastructure. When it comes to troubleshooting and driving operational efficiencies – network issues such as congestion, packet errors, etc. appear when diagnosing performance issues and bottlenecks between the network operations team and the server or hyper-converged platforms such as Nutanix, VMware, Cisco, and HPE, etc.
A typical challenge is a server administrator’s ability to quickly identify that a packet-loss issue is not caused by the server I/O components (e.g. driver, Network Interface Card hardware, etc.) in the path of a flow. As an example, a troubleshooting ticket is about to be opened due to poor application performance, and data needs to be collected:
● The server team needs to quickly identify where the workloads exist across the data center floor.
● Are the workloads virtual machines, bare metal, containers, or a mixture of all of them?
● What physical servers do the workloads involved currently reside on?
● What top-of-rack switches do the servers connect to?
● Quickly grab a bi-directional full packet capture for all flows involved in the transaction.
● Identify if there is packet loss happening:
- at the PF or VF to/from PCIe bus,
- within the NIC, or
- on the uplinks connected to the top-of-rack switches
...for all servers in the path of a business transaction
● Present the data to the network team that it is not an I/O issue in the servers involved, and the network team needs to dig into a network performance issue causing the problem
The time spent gathering the data can affect mean time to resolution (MTTR) and predictive maintenance when finger-pointing ensues after a ticket is opened due to poor performance or an outage. Connecting the dots at the edge for a server administrator becomes fast and painless with the Pensando Distributed Services Platform (DSP). It can be gradually introduced in an enterprise without requiring a significant upfront investment, both in financial terms and in learning about the services offered.
Pensando’s Distributed Services Platform
The Pensando platform enables various network, security, and visibility services, implemented by a set of Distributed Services Cards (DSCs) and centrally managed and monitored by a Policy and Services Manager (PSM). DSCs are PCIe host adapters that are deployed in the data center servers. The PSM is a centralized management platform, leveraging an intent-based model that delivers pervasive visibility, network, and security policy to DSCs for services implementation at the edge. The PSM provides both a secure API and a GUI management framework.
The remainder of this document describes such a journey in which the DSP can be used to solve visibility and troubleshooting challenges with x86 environments.
From the Host, DSCs Look Like Just Another NIC
The first step to introduce the Pensando solution in an Enterprise network consists of deploying DSCs in place of legacy NICs when ordering x86 servers. Once the server is powered on, and the driver for the DSC is installed, the host can be deployed and managed like any other server. Pensando offers drivers for all major modern x86 operating systems.
Introducing the PSM and Gaining Visibility
The next step to reaping more benefits from the PSM is to use it to centrally control the DSCs and unlock additional functionality.
Once a PSM is deployed as a central manager, each DSC will boot from the network and discover the PSM that it is assigned to. Among other capabilities, the PSM provides full lifecycle management of the DSCs, including firmware upgrades, health monitoring, centralized events and alarm reporting, and display of a robust set of metrics to help with troubleshooting and to provide pervasive visibility. The administrator can use the PSM to access telemetry data collected by the DSCs (“Fields” in the figures below), organized in various categories (“Measurement" in the figure below).
Server and Network administrators can use this powerful distributed monitoring capability to take the pulse of the network, identify potential performance bottlenecks, and remediate them even before they become a problem.
Figure A below displays the first step in getting details about the DSC and its connections to the network and the operating system. The initial output is very similar to what you would see with a “show interface Ethernet 1/1” command on a typical top-of-rack switch. You can quickly identify if the DSC is passing bi-directional traffic through the DSC uplink, as well as determining if the interface is experiencing any drops. If the physical network is not experiencing any drops and the DSC shows no drops at the connections to the network, and towards the PCIe bus, the output below can help narrow down the root cause of packet drops to a kernel function within the workload’s OS.
Each DSC collects a broad set of metrics organized in various categories. A subset of these metrics can be selected to create custom graphs on the PSM (such as the graphs shown in the figures below) to be displayed to show the latest values of the metrics, offering a dashboard of the health and performance of the hosts, and the platform itself. Detailed performance and error metrics are built into user-customized charts that aggregate over time.
As an example in Figure B below, you identify on a per DSC basis the node experiencing drops related to TCP RSTs, TCP out of order Windows, etc.
● Session: (# CPS, # TCP/UDP/ICMP, TCP RSTs, Window Size Zero, etc.)
● Interface: (tx/rx unicast, multicast, broadcast, error stats, etc.)
In a future software release, the PSM will allow users to configure a policy that defines a metrics-based alerting behavior for their DSCs. It will enable customers to set a percentage increase per metric. As an example, upon exceeding a configurable metric error threshold, the PSM operator may request a full-packet capture for that DSC node, to archive to an NFS or other network share.
In figure C below, the measurement “Uplink Interface Packet Statistics” provides rich details about interface statistics. You can create custom metrics to monitor specific details about one or both interfaces on the DSC connected to the host operating system as well as connected to the network you are monitoring and or troubleshooting. The display will allow you to chart and monitor these specific metrics for any specified interval: hour(s), day(s), week(s) or month(s).
Figure D represents metrics that summarize specific DSC sessions such as TCP RST sent, TCP sessions, Half-Open TCP sessions, and Drops.
Figure E below presents valuable metrics as it relates to ASIC health on the DSC.
Figure F below presents the connections per second maintained by a DSC or group of DSCs.
Packet Mirroring
If the server I/O looks clean from the metrics reviewed about a DSC, an administrator can use the DSC as a virtual TAP (Test Access Point) for full packet captures. Dynamic Flow Mirroring provides an ability to inspect packet content at line-rate, isolating and extracting application-specific traffic for delivery to appropriate tools for further processing. It uses traffic replication and spanning to a production network or packet broker network that delivers packets to an analytics engine performing content inspection and correlation, for compliance.
Wire-rate bi-directional complete packet captures can be enabled on the fly when troubleshooting if deeper and richer visibility is needed. A DSC can make a copy of each packet conforming to a given mirroring policy and send it to a collector (i.e., Splunk, ELK, Wireshark) using ERSPAN (Encapsulated Remote Switched Port Analyzer) encapsulation.
The administrator can choose whether to send all of the traffic that is leaving or entering one interface of a DSC (which is called interface-based mirroring or bidirectional ERSPAN), or only packets that match mirroring policies configured on the PSM. As shown in the figure below, mirroring policies are also defined as a flow identified by a 5-tuple: source and destination IP address, transport protocol, source, and destination transport port.
Flow Level Visibility
An administrator can also enable flow visibility. Flow exporters on the Pensando DSC can be enabled to export flow information and statistics to flow monitors on remote NetFlow collectors in IPFIX (Netflow.v10) format. Centralized configuration allows tuning of export timeouts, export intervals, export formats, and specifying output to more than one collector. In addition to exposing the standard set of NetFlow fields, the DSC also reveals flow start and last seen time, maximum segment size, and state offering a wealth of information for troubleshooting use-cases. As shown in the figure below, each flow is identified through the 5-tuple: source and destination IP address, transport protocol, source, and destination transport port. In a future software release, flows will also be definable by labels ingested from the VM or container orchestrator.
Moreover, the PSM administrator can specify the collector (or target) that should receive the flow information in IPFIX format and the transport protocol.
It is worth noting that the administrator does not have to worry about which DSC should collect the flow data. The PSM and DSCs work together to ensure that the collection and export policy is available to all DSCs, and those nodes involved in the flow will automatically collect and export the relevant information. If the workload generating a flow is relocated on the network (for example, because a VM moves through vMotion or a server is migrated to a different hardware host), the Pensando Platform will detect this and continue to seamlessly collect information without any intervention by the administrator.
Through these visibility functions, administrators can get insight into the traffic patterns in their enterprise data centers to troubleshoot performance issues or identify performance bottlenecks even before they become an issue.
Flow Logging and Tracking
The Pensando DSC supports full TCP connection state tracking, providing application-centric deep information inspection and telemetry. This function is usually offered only by high-end stateful firewalls.
Once connection tracking is enabled, the flow/session entries for the following protocols are validated in the pipeline:
● TCP state and connection tracking
- Perform TCP SYN validation, evaluating security policy prior to pipeline flow programming.
- Validate TCP sequence and ack numbers are within the expected TCP window, for all packets.
- Perform session closing state tracking, FIN/RST, and adjusting TCP state to closing.
● ICMP request and response tracking
- Using ICMP ID and sequence numbers, invalid ICMP responses can be filtered, and requests and responses can be correlated.
- ICMP sessions can be aggressively aged out instead of waiting for the inactivity period to expire.
The Policy and Services Manager collects flow logs that indicate source and destination IPs, ports, action, rule id, direction, making it easy for users to correlate and search.
Conclusions
The place to understand application behavior and identify sources of application outages is at the server/appliance edge: closer to the workloads, and directly in-line with network traffic. Server admins who typically capture pcaps directly on the host OS for troubleshooting can offload this function to the DSC as a consumable service available to both network and server teams. Server administrators can now at scale have the data at hand to quickly identify if it is a server I/O issue or now have the data to prove it is not. Network admins are no longer blind beyond the switchport connecting to the server – the demarcation between network and server is extended all the way to the server’s PCIe bus. The Pensando Platform will reduce MTTR and reduce the risk of negatively impacting business traffic in highly shared environments.
The key benefits are:
● Zero performance hit on production traffic
● Application location-awareness
● Operational simplicity
● Complete visibility, capturing bi-directional data (Tx/Rx)
A federation of DSCs managed under a “single pane of glass” of the Policy and Services Manager (PSM) helps eliminate the lack of visibility, and network administration challenges found in prior monitoring platforms with full-stack solutions. Pensando’s DSC-derived metrics will turn on the lights for server and network administrators to understand network statistics at the server edge and Pensando’s Dynamic Flow Mirroring solution will reduce the need for costly network TAP appliances in data center infrastructure, and eliminate the need to reconfigure top-of-rack switches and apply traffic spanning sessions.
Below is the visual representation of the additional points of visibility in blue for each host