Tuning 10Gb network cards on Linux

Tuning 10Gb network cards on Linux

A basic introduction to concepts used to tune fast network cards


Article: https://www.kernel.org/doc/ols/2009/ols2009-pages-169-184.pdf


Tuning 10Gb Network Cards on Linux: A Detailed Practical Guide

Breno Henrique Leitao, IBM [email protected]


Abstract

The evolution of Ethernet—from 10?Mbit/s to 10?Gb/s—has placed significant pressure on CPU and system I/O performance. Traditional Linux default settings are not optimized for high-speed networking, and without proper tuning, a 10?Gb link may underperform. This guide explains the theory behind key tuning techniques and provides true, working commands for configuring your Linux system for maximum throughput on 10?Gb network adapters.


1. Introduction

High-speed networks require more than just fast hardware; the operating system must be carefully configured to handle increased TCP/IP processing, interrupt management, and data buffering. In this article, we walk through practical steps—from enabling jumbo frames to configuring kernel parameters—to get your Linux server operating at near wire speed.


2. Key Network Concepts and Terminology

Before diving into configurations, it’s important to understand some fundamental terms:

  • Throughput: The effective rate of data transfer (useful bits transmitted) across a network.
  • Round Trip Time (RTT): The time it takes for a packet to travel from source to destination and back.
  • Bandwidth-Delay Product (BDP): Calculated as bandwidth (in bits/s) × RTT (in seconds), it approximates the maximum amount of data “in flight” on a network.
  • Jumbo Frames: Ethernet frames with a payload larger than the standard 1500 bytes (commonly set to 9000 bytes) that reduce per-packet overhead.


3. Data Link Layer Optimization

3.1 Jumbo Frames

Increasing the MTU on your interfaces reduces header overhead and boosts performance on high-speed networks. Ensure that all devices in the same broadcast domain are configured with the same MTU value.

Command Example:

# Set jumbo frames (9000 bytes) on interface eth0
sudo ifconfig eth0 mtu 9000
        

Note: Modern systems may use ip instead of ifconfig:

sudo ip link set dev eth0 mtu 9000
        

3.2 Multi-Streaming and Transmission Queues

To fully utilize 10?Gb throughput, it is often necessary to use multiple streams (sockets) rather than relying on a single connection. Additionally, the default transmission queue length may need adjustment.

Commands to Check/Set Queue Length:

# Check current TX queue length
ifconfig eth0 | grep TX

# Set transmission queue length to 3000
sudo ifconfig eth0 txqueuelen 3000
        

4. CPU and Interrupt Optimization

4.1 SMP IRQ Affinity

On multi-core systems, binding device interrupts to a specific CPU can significantly reduce cache misses and improve performance.

Step 1: Identify the Interrupt Number

cat /proc/interrupts | grep eth0
        

Step 2: Check the current affinity

cat /proc/irq/<IRQ_NUMBER>/smp_affinity
        

Step 3: Set the affinity manually (for example, binding to CPU0):

# Replace <IRQ_NUMBER> with the actual number, and 1 (hex) means CPU0
echo 1 | sudo tee /proc/irq/<IRQ_NUMBER>/smp_affinity
        

Tip: Disable the irqbalance daemon if it interferes with manual settings:

sudo service irqbalance stop
        

Or, if you prefer to disable IRQ balancing for specific CPUs, edit /etc/sysconfig/irqbalance (or the equivalent file on your distro) and set the IRQBALANCE_BANNED_CPUS option accordingly.

4.2 Taskset Affinity

Binding processes to specific CPUs using taskset minimizes task migration and cache invalidation.

Command Example:

# Bind process with PID 4767 to CPU0 (0x1)
sudo taskset -p 0x1 4767
        

4.3 Interrupt Coalescence and NAPI

Modern NICs use interrupt coalescence to reduce the number of interrupts by grouping multiple packets into a single interrupt. Many drivers also support NAPI (New API), which switches to polling under high load.

To Configure via ethtool:

# Enable RX interrupt coalescence on eth0 (specific parameters vary by NIC)
sudo ethtool -C eth0 rx-usecs 125

# For many devices, the default NAPI settings are enabled automatically.
        

5. Offload Features and NIC-Specific Optimizations

Hardware offload features reduce CPU load by shifting work (e.g., checksum calculation, segmentation) to the NIC.

5.1 Checksum Offloads

Commands to Enable RX/TX Checksum Offload:

# Enable RX checksum offload
sudo ethtool -K eth0 rx on

# Enable TX checksum offload
sudo ethtool -K eth0 tx on
        

5.2 Scatter-Gather (SG) and Page Flipping

Scatter-Gather reduces the overhead of copying data by allowing non-contiguous buffers for DMA.

# Enable Scatter-Gather
sudo ethtool -K eth0 sg on
        

5.3 TCP Segmentation Offload (TSO)

TSO lets the NIC handle segmentation of large packets, significantly reducing CPU overhead.

# Enable TSO
sudo ethtool -K eth0 tso on
        

5.4 Large Receive Offload (LRO) and Generic Segmentation Offload (GSO)

LRO aggregates multiple incoming packets; GSO postpones segmentation until just before transmission.

# For LRO, depending on your driver, you may need to enable it via module parameters.
# Enable GSO using ethtool:
sudo ethtool -K eth0 gso on
        

6. Kernel TCP/IP Tuning

Linux’s default TCP/IP parameters are often too conservative for 10?Gb networks. Adjust the following sysctl settings for better performance.

6.1 TCP Window Scaling

Ensure window scaling is enabled to support buffers larger than 64?KB.

# Check window scaling status
sysctl net.ipv4.tcp_window_scaling

# (Typically enabled by default; if not, add to /etc/sysctl.conf)
        

6.2 TCP Timestamps

While TCP timestamps help in RTT calculation, they add overhead. Disable them if latency is critical.

sudo sysctl -w net.ipv4.tcp_timestamps=0
        

6.3 TCP FIN Timeout

Lower the FIN timeout to free resources faster on busy servers.

# Set FIN timeout to 15 seconds
sudo sysctl -w net.ipv4.tcp_fin_timeout=15
        

6.4 TCP SACK and Nagle Algorithm

Disable SACK on reliable networks to improve throughput; adjust Nagle as needed.

# Disable TCP selective acknowledgements
sudo sysctl -w net.ipv4.tcp_sack=0

# Applications can disable Nagle by setting TCP_NODELAY; this is done in the application code.
        

6.5 Memory Buffer Settings

Tune buffer sizes so that the maximum buffer exceeds the BDP.

# Set receive and send buffers
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 3526656"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 4194304"

# Increase core socket buffer limits
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
        

To make these changes permanent, add the settings to /etc/sysctl.conf and reload with:

sudo sysctl -p /etc/sysctl.conf
        

7. Hardware Bus Considerations

7.1 PCI, PCI-X, and PCI Express

The bus architecture can be a bottleneck. A legacy PCI bus may deliver only around 350?MB/s, whereas PCIe (especially version 2.0 or later) is required for 10?Gb throughput.

Check your PCI devices:

lspci | grep "10 Gigabit"
        

7.2 Message-Signalled Interrupts (MSI/MSI-X)

MSI and MSI-X improve interrupt handling by allowing multiple, directed interrupts.

Verify MSI Settings:

# Find your NIC’s PCI slot and then check its capabilities:
lspci -vv -s 0002:01:00.0 | grep -i "Message Signalled Interrupt"
        

7.3 Memory Bus and RDMA

Ensure that your memory subsystem can sustain high throughput. For ultra-low latency between servers, consider RDMA:

  • RDMA: Bypasses the kernel for direct memory-to-memory transfers. Open-source implementations like OpenFabrics Enterprise Distribution (OFED) support RDMA for applications such as NFS over RDMA.


8. TCP Congestion Control Algorithms

Different congestion control algorithms have pros and cons in high-speed networks.

8.1 Reno, CUBIC, and FAST

  • Reno: Traditional and TCP-friendly but may underutilize high-speed links.
  • CUBIC: Default on modern Linux kernels (from 2.6.19 onward), optimized for high-speed, high-latency environments.
  • FAST: A delay-based algorithm that maintains a steady queue; less aggressive than loss-based methods.

Switching Algorithms:

Check available algorithms:

cat /proc/sys/net/ipv4/tcp_available_congestion_control
        

Check the current algorithm:

cat /proc/sys/net/ipv4/tcp_congestion_control
        

Change to a new algorithm (e.g., cubic):

echo cubic | sudo tee /proc/sys/net/ipv4/tcp_congestion_control
        

9. Benchmarking and Performance Testing

Validating your tuning efforts is critical. Use the following tools:

9.1 Netperf

Netperf measures both throughput and transaction performance.

Examples:

# TCP stream test for 10 seconds against host 192.168.200.2
netperf -t TCP_STREAM -H 192.168.200.2 -l 10

# Request/Response test
netperf -H 192.168.200.2 -t TCP_RR -l 10
        

For multi-stream testing, you can use a script similar to this:

#!/bin/bash
NUMBER=8
TMPFILE=$(mktemp)
PORT=12895
DURATION=10
PEER="192.168.200.2"

for i in $(seq $NUMBER); do
  netperf -H $PEER -p $PORT -t TCP_MAERTS -P 0 -c -l $DURATION -- -m 32K -M 32K -s 256K -S 256K >> $TMPFILE &
  netperf -H $PEER -p $PORT -t TCP_STREAM -P 0 -c -l $DURATION -- -m 32K -M 32K -s 256K -S 256K >> $TMPFILE &
done

sleep $DURATION
echo -n "Total result: "
awk '{sum += $5} END{print sum}' $TMPFILE
        

9.2 Pktgen

Pktgen is a kernel module for generating high-speed synthetic traffic to stress-test NICs. For more details, see the kernel documentation file (e.g., /usr/src/linux/Documentation/networking/pktgen.txt).

9.3 Mpstat

Use Mpstat from the sysstat package to monitor per-CPU load and check that your IRQ affinity settings are effective.

mpstat -P ALL 1
        

10. Conclusion

Achieving near wire-speed performance on 10?Gb network cards in Linux requires coordinated tuning at multiple layers. From adjusting NIC offload features and CPU interrupt affinity to refining TCP/IP kernel parameters and ensuring adequate bus bandwidth, every configuration step counts. By applying the detailed commands and methodologies in this guide, Linux administrators and DevOps engineers can optimize their systems for high-throughput, low-latency environments.


References

  1. RFC 2647 – Benchmarking Terminology for Firewall Performance
  2. Annie P. Foong et al., “TCP Performance Re-Visited”
  3. Amber D. Huffman and Knut S. Grimsrud, “Method and Apparatus for Reducing Disk Drive Data Transfer Interrupt Service Latency”
  4. Van Jacobson and Michael J. Karels, “Congestion Avoidance and Control”
  5. Chunmei Liu and Eytan Modiano, “On the Performance of AIMD Protocols in Hybrid Space-Terrestrial Networks”
  6. Kevin Fall and Sally Floyd, “Comparisons of Tahoe, Reno, and Sack TCP”
  7. Jeonghoon Mo et al., “Analysis and Comparison of TCP Reno and Vegas”


This guide now includes additional detail along with true, practical commands to help you effectively tune and benchmark your 10?Gb network interfaces on Linux.


要查看或添加评论,请登录

Reza Bojnordi的更多文章