Tuning 10Gb network cards on Linux
Reza Bojnordi
Site Reliability Engineer @ BCW Group | Solutions Architect Google Cloud and OpenStack and Ceph Storage
A basic introduction to concepts used to tune fast network cards
Tuning 10Gb Network Cards on Linux: A Detailed Practical Guide
Breno Henrique Leitao, IBM [email protected]
Abstract
The evolution of Ethernet—from 10?Mbit/s to 10?Gb/s—has placed significant pressure on CPU and system I/O performance. Traditional Linux default settings are not optimized for high-speed networking, and without proper tuning, a 10?Gb link may underperform. This guide explains the theory behind key tuning techniques and provides true, working commands for configuring your Linux system for maximum throughput on 10?Gb network adapters.
1. Introduction
High-speed networks require more than just fast hardware; the operating system must be carefully configured to handle increased TCP/IP processing, interrupt management, and data buffering. In this article, we walk through practical steps—from enabling jumbo frames to configuring kernel parameters—to get your Linux server operating at near wire speed.
2. Key Network Concepts and Terminology
Before diving into configurations, it’s important to understand some fundamental terms:
3. Data Link Layer Optimization
3.1 Jumbo Frames
Increasing the MTU on your interfaces reduces header overhead and boosts performance on high-speed networks. Ensure that all devices in the same broadcast domain are configured with the same MTU value.
Command Example:
# Set jumbo frames (9000 bytes) on interface eth0
sudo ifconfig eth0 mtu 9000
Note: Modern systems may use ip instead of ifconfig:
sudo ip link set dev eth0 mtu 9000
3.2 Multi-Streaming and Transmission Queues
To fully utilize 10?Gb throughput, it is often necessary to use multiple streams (sockets) rather than relying on a single connection. Additionally, the default transmission queue length may need adjustment.
Commands to Check/Set Queue Length:
# Check current TX queue length
ifconfig eth0 | grep TX
# Set transmission queue length to 3000
sudo ifconfig eth0 txqueuelen 3000
4. CPU and Interrupt Optimization
4.1 SMP IRQ Affinity
On multi-core systems, binding device interrupts to a specific CPU can significantly reduce cache misses and improve performance.
Step 1: Identify the Interrupt Number
cat /proc/interrupts | grep eth0
Step 2: Check the current affinity
cat /proc/irq/<IRQ_NUMBER>/smp_affinity
Step 3: Set the affinity manually (for example, binding to CPU0):
# Replace <IRQ_NUMBER> with the actual number, and 1 (hex) means CPU0
echo 1 | sudo tee /proc/irq/<IRQ_NUMBER>/smp_affinity
Tip: Disable the irqbalance daemon if it interferes with manual settings:
sudo service irqbalance stop
Or, if you prefer to disable IRQ balancing for specific CPUs, edit /etc/sysconfig/irqbalance (or the equivalent file on your distro) and set the IRQBALANCE_BANNED_CPUS option accordingly.
4.2 Taskset Affinity
Binding processes to specific CPUs using taskset minimizes task migration and cache invalidation.
Command Example:
# Bind process with PID 4767 to CPU0 (0x1)
sudo taskset -p 0x1 4767
4.3 Interrupt Coalescence and NAPI
Modern NICs use interrupt coalescence to reduce the number of interrupts by grouping multiple packets into a single interrupt. Many drivers also support NAPI (New API), which switches to polling under high load.
To Configure via ethtool:
# Enable RX interrupt coalescence on eth0 (specific parameters vary by NIC)
sudo ethtool -C eth0 rx-usecs 125
# For many devices, the default NAPI settings are enabled automatically.
5. Offload Features and NIC-Specific Optimizations
Hardware offload features reduce CPU load by shifting work (e.g., checksum calculation, segmentation) to the NIC.
5.1 Checksum Offloads
Commands to Enable RX/TX Checksum Offload:
# Enable RX checksum offload
sudo ethtool -K eth0 rx on
# Enable TX checksum offload
sudo ethtool -K eth0 tx on
5.2 Scatter-Gather (SG) and Page Flipping
Scatter-Gather reduces the overhead of copying data by allowing non-contiguous buffers for DMA.
# Enable Scatter-Gather
sudo ethtool -K eth0 sg on
5.3 TCP Segmentation Offload (TSO)
TSO lets the NIC handle segmentation of large packets, significantly reducing CPU overhead.
# Enable TSO
sudo ethtool -K eth0 tso on
5.4 Large Receive Offload (LRO) and Generic Segmentation Offload (GSO)
LRO aggregates multiple incoming packets; GSO postpones segmentation until just before transmission.
# For LRO, depending on your driver, you may need to enable it via module parameters.
# Enable GSO using ethtool:
sudo ethtool -K eth0 gso on
6. Kernel TCP/IP Tuning
Linux’s default TCP/IP parameters are often too conservative for 10?Gb networks. Adjust the following sysctl settings for better performance.
6.1 TCP Window Scaling
Ensure window scaling is enabled to support buffers larger than 64?KB.
# Check window scaling status
sysctl net.ipv4.tcp_window_scaling
# (Typically enabled by default; if not, add to /etc/sysctl.conf)
6.2 TCP Timestamps
While TCP timestamps help in RTT calculation, they add overhead. Disable them if latency is critical.
sudo sysctl -w net.ipv4.tcp_timestamps=0
6.3 TCP FIN Timeout
Lower the FIN timeout to free resources faster on busy servers.
# Set FIN timeout to 15 seconds
sudo sysctl -w net.ipv4.tcp_fin_timeout=15
6.4 TCP SACK and Nagle Algorithm
Disable SACK on reliable networks to improve throughput; adjust Nagle as needed.
# Disable TCP selective acknowledgements
sudo sysctl -w net.ipv4.tcp_sack=0
# Applications can disable Nagle by setting TCP_NODELAY; this is done in the application code.
6.5 Memory Buffer Settings
Tune buffer sizes so that the maximum buffer exceeds the BDP.
# Set receive and send buffers
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 3526656"
sudo sysctl -w net.ipv4.tcp_wmem="4096 16384 4194304"
# Increase core socket buffer limits
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
To make these changes permanent, add the settings to /etc/sysctl.conf and reload with:
sudo sysctl -p /etc/sysctl.conf
7. Hardware Bus Considerations
7.1 PCI, PCI-X, and PCI Express
The bus architecture can be a bottleneck. A legacy PCI bus may deliver only around 350?MB/s, whereas PCIe (especially version 2.0 or later) is required for 10?Gb throughput.
Check your PCI devices:
lspci | grep "10 Gigabit"
7.2 Message-Signalled Interrupts (MSI/MSI-X)
MSI and MSI-X improve interrupt handling by allowing multiple, directed interrupts.
Verify MSI Settings:
# Find your NIC’s PCI slot and then check its capabilities:
lspci -vv -s 0002:01:00.0 | grep -i "Message Signalled Interrupt"
7.3 Memory Bus and RDMA
Ensure that your memory subsystem can sustain high throughput. For ultra-low latency between servers, consider RDMA:
8. TCP Congestion Control Algorithms
Different congestion control algorithms have pros and cons in high-speed networks.
8.1 Reno, CUBIC, and FAST
Switching Algorithms:
Check available algorithms:
cat /proc/sys/net/ipv4/tcp_available_congestion_control
Check the current algorithm:
cat /proc/sys/net/ipv4/tcp_congestion_control
Change to a new algorithm (e.g., cubic):
echo cubic | sudo tee /proc/sys/net/ipv4/tcp_congestion_control
9. Benchmarking and Performance Testing
Validating your tuning efforts is critical. Use the following tools:
9.1 Netperf
Netperf measures both throughput and transaction performance.
Examples:
# TCP stream test for 10 seconds against host 192.168.200.2
netperf -t TCP_STREAM -H 192.168.200.2 -l 10
# Request/Response test
netperf -H 192.168.200.2 -t TCP_RR -l 10
For multi-stream testing, you can use a script similar to this:
#!/bin/bash
NUMBER=8
TMPFILE=$(mktemp)
PORT=12895
DURATION=10
PEER="192.168.200.2"
for i in $(seq $NUMBER); do
netperf -H $PEER -p $PORT -t TCP_MAERTS -P 0 -c -l $DURATION -- -m 32K -M 32K -s 256K -S 256K >> $TMPFILE &
netperf -H $PEER -p $PORT -t TCP_STREAM -P 0 -c -l $DURATION -- -m 32K -M 32K -s 256K -S 256K >> $TMPFILE &
done
sleep $DURATION
echo -n "Total result: "
awk '{sum += $5} END{print sum}' $TMPFILE
9.2 Pktgen
Pktgen is a kernel module for generating high-speed synthetic traffic to stress-test NICs. For more details, see the kernel documentation file (e.g., /usr/src/linux/Documentation/networking/pktgen.txt).
9.3 Mpstat
Use Mpstat from the sysstat package to monitor per-CPU load and check that your IRQ affinity settings are effective.
mpstat -P ALL 1
10. Conclusion
Achieving near wire-speed performance on 10?Gb network cards in Linux requires coordinated tuning at multiple layers. From adjusting NIC offload features and CPU interrupt affinity to refining TCP/IP kernel parameters and ensuring adequate bus bandwidth, every configuration step counts. By applying the detailed commands and methodologies in this guide, Linux administrators and DevOps engineers can optimize their systems for high-throughput, low-latency environments.
References
This guide now includes additional detail along with true, practical commands to help you effectively tune and benchmark your 10?Gb network interfaces on Linux.