登录查看更多内容

Linux Performance Tuning

Reza Bojnordi

Site Reliability Engineer & System Engineer @ BCW Group | Solutions Architect & Cloud Operations

发布日期: 2022年6月2日

+ 关注

1.1 Linux process management

process scheduling
interrupt handling
signaling
process prioritization
process switching
process state
process memory

A process is an instance of execution that runs on a processor.

task_struct?->?process descriptor

Life cycle of processes

parent process?->?fork()?->?child process?->?exec()?->?child process?->?exit()?->?zombie process?->?parent process

Copy On Write

Kernel only assgins the new physical page to the child processes when the child process call?exec()?which copies the new program to the address space of the child process.

The child process will not be completely removed unitl the parent process knows of the termination of its child process by the?wait()?system call.

Thread

?thread is an execution unit generated in a single process. It runs parallel with other threads in the same process.

Thread creation is less expensive than process creation because a thread does not need to copy resources on creation.

Process priority and nice level

Process priority is a number that determines the order in which the process is handled by the CPU and is determined by dynamic priority and static priority.

Linux supports?nice?levels from 19(lowest priority) to -20(highest priority).

Context switching

During process execution, information on the running process is stored in registers on the processor and its cache. The set of data that is loaded to the register for the executing process is called the context.

Interrupt handling

The interrupt handler notifies the Linux Kernel of an event. It tells the kernel to interrup process execution and perform interrup handling as quickly as possible because some device requires quick responsiveness.

Interrupts cause?context switching

In a multi-processor environment, interrupts are handled by each processor. Binding interrupts to a single physical processor could improve system performance.

Process state

Every process has its own state that shows what is currently happening in the process.

TASK_RUNNING
TASK_STOPPED
TASK_INTERRUPTIBLE
TASK_UNINTERRUPTIBLE
TASK_ZOMBIE

Zombie processes

It is not possible to kill a zombie process with the kill command, because it is already considered dead. If you cannot get rid of a zombie, you can kill the parent process and then the zombie disappears as well.

Process memory segments

Text segment
The area where executable code is stored
Data segment
The data segment consists of these 3 areas.
Data: The area where initialized data such as static variables are stored
BSS: The area where zero-initialized data is stored. The data is initialized to zero.
Heap segment
Heap: The area where?malloc()?allocates dynamic memory based on the demand. The heap grows towards higher addresses.
Stack segment
The area where local variables, function paramenters, and the return address of a function is stored. The stack grows toward lower addresses.

Linux CPU scheduler

O(1)?https://en.wikipedia.org/wiki/O(1)_scheduler?https://www.ibm.com/developerworks/library/l-completely-fair-scheduler/

two process priority arrays

active
expired

As processes are allocated a timeslice by the scheduler, based on their priority and prior blocking rate, they are placed in a list of processes for their priority in the active array. When they expire their timeslice, they are allocated a new timeslice and placed on the expired array.

When all processes in the active array have expired their timeslice, the two arrays are switched, restarting the algorithm

1.2 Linux memory architecture

32-bit architectures -- 4 GB address space (3 GB usesr space and 1 GB kernel space) 64-bit architectures -- 512 GB or more for both user/kernel space.

Virtual memory manager

Applications do not allocate physical memory but request a memory map of a certain size at the Linux kernel and in exchange receive a map in virtual memory.

VM does not necessarily have to be mapped into physical memory. If your app allocates a large amount of memory, some of it might be mmapped to the swap file on the disk subsystem.

Applications usually do not write directly to the disk subsystem, but into cache or buffers.

Page frame allocation

A page is a group of contiguous linear addresses in physical memory (page frame) or virtual memory.

A page is usually 4K bytes in size.

Buddy system

The Linux kernel maintains its free pages by using a mechanism called a?buddy system.

The buddy system maintains free pages and tries to allocate pages for page allocation requests. It tries to keep the memory area contiguous.

When the attempt of page allocation fails, the page reclaiming is activated.

Page frame reclaiming

kswapd?kernel thread and?try_to_free_page()?kernel function are responsible for page reclaiming.

kswapd?tries to find the candidate pages to be taken out of active pages based on?LRU?principle.

The pages are used mainly for two purposes:?page cache?and?process address space?The page cache is pages mapped to a file on disk. The pages that belong to a process address space are used for heap and stack.

swap

If the virtual memory manager in Linux realizes that a memory page has been allocated but not used for a significant amount of time, it moves this memory page to swap space.

The fact that swap space is being used does not indicate a memory bottleneck; instead, it proves how efficiently Linux handles system resources.

1.3 Linux file systems

Virtual file system

VFS is an abstraction interface layer that resides between the user process and various types of Linux file system implementations.

Journaling

non-journaling file system?fsck?checks all the metadata and recover the consistency at the time of next reboot. But when the system has a large volume, it takes a lot of time to be completed.?The system is not operational during this process

journaling file system?Writing data to be changed to the area called the journal area before writing the data to the actual file system. The journal area can be placed both in the file system or out of the file system. The data written to the journal area is called the journal log. It includes the changes to file system metadata and the actual file data it supported.

Ext2

The extended 2 file system is the predeceessor of the extended 3 file system.

no journaling capabilities.
Starts with the boot sector and split entire file system into several small block groups contributes to performance gain because the i-node table and data blocks which hold user data can reside closer on the disk platter, so seek time can be reduced.

领英推荐

Why Every Developer Should Master the Command-Line…

Bixal 6 个月前

Home Lab#5: Monitor Docker Events using Wazuh

Rajneesh G. 1 年前

Unlocking Linux Performance: A Hands-On Journey…

John Murillo-Giraldo 8 个月前

Ext3

Availability: Ext3 always writes data to the disks in a consistent way, so in case of an unclean shutdown, the server does not have to spend time checking the consistency of the ddata, thereby reducing system recovery from hours to seconds.
Data integrity: By specifying the journaling mode?data=journal?on the mount command, all data, both file data and metadata, is journaled.
Speed
Flexibility

Mode of journaling

journal
ordered
writeback

1.4 Disk I/O subsystem

Before a processor can decode and execute instructions, data should be retrieved all the way from sectors on a disk platter to the processor and its registers. The results of the executions can be written back to the disk.

I/O subsystem architecture

A process requests to write a file through the?write()?system call.
The kernel updates the page cache mapped to the file.
A?pdflush?kernel thread takes care of flushing the page cache to disk.
The file system layer puts each block buffer together to a?bio?struct and submits a write request to the block device layer.
The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue.
A device driver such as SCSI or other device specific drivers will take care of write operation.
A disk device firmware performs hardware operations like seek head, rotation and data transfer to the sector on the plat

Cache

Memory hierarchy

L1 cache, L2 cache, L3 cache, RAM and some other caches between the CPU and disk.

The higher the cache hit rate on faster memory is, the faster the access to the data.

Locality of reference

The data most recently used has a high probability of being used in the near future.
The data that resides close to the data which has been used has a high probability of being used.

Flushing a dirty buffer

When a process changes data, it changes the memory first, so at the this time the data in memory and in disk is not identical and the data in memory is refered to as a?dirty buffer.

The dirty buffer should be synchronized to the data on the disk as soon as possible, or the data in memory could be lost if a suddden crash occurs.

The synchronization process for a dirty buffer is called?flush.

kupdate?-- occurs on a regular basis.

/proc/sys/vm/dirty_background_ratio?-- the propotion of dirty buffers in mem

Block layer

The block layer handles all the activity related to block device operation.

The?bio?structure is an interface between the file system layer and the block layer.

Block sizes

The smallest amount of data that can be read or written to a drive, can have a direct impact on a server's performance.

I/O elevator

Anticipatory
Complete Fair Queuing
Deadline
NOOP

I/O device driver

SCSI (?)

1.4 RAID and storage system

1.5 Network subsystem

Networking implementation

The socket provides an interface for user applications.

When an application sends data to its peer host, the application creates its data
The application opens the socket and writes the data through the socket interface.
The?socket buffer?is used to deal with the transfered data. The socket buffer has reference to the data and it goes down through the layers.
In each layer, appropriate operations such as parsing the headers, adding and modifying the headers, check sums, routing operation, fragmentation, and so on are performed. When the socket buffer goes down through the layers, the data itself is not copied between the layers. Because copying actual data between different layers is not effective, the kernel avoids unnecessary overhead by just changing the reference in the socket buffer and passing it to the next layer.
Finally, the data goes out to the wire from the network interface card.
The Ethernet frame arrives at the network interface of the peer host.
The frame is moved into the network interface card buffer if the MAC address matches the MAC address of the interface card.
The network interface card eventually moves the packet into a socket buffer and issues a hard interrupt at the CPU.
The CPU then processes the packeet and moves it up through the layers until it arrives at (for example) a TCP port of an application such as Apache.

Socket buffer

/proc/sys/net/core/rmem_max
/proc/sys/net/core/rmem_default
/proc/sys/net/core/wmem_max
/proc/sys/net/core/wmem_default
/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem

Network API(NAPI)

The standard implementation of the network stack in Linux focuses more on reliability and low latency than on low overhead and high throughput.

Gigabit Ethernet and modern applications can create thousands of packets per second, causing a large number of interruts and context switches to occur.

For the first packet, NAPI works just like traditional implementation as it issues an interrupt for the first packet. But after the first packet, the interface goes into a polling mode. As long as there are packets in the DMA ring buffer of the network interface, no new interrupts will be caused, effectively reducing context switching and the associated overhead. Should the last packet be processed and the ring buffer be emptied, then the interface card will again fall back into the interrupt mode. NAPI also has the advantage of improved multiprocessor scalability by creating soft interrupts that can be handled by multiple processors.

Netfilter

You can manipulate and configure Netfilter using the iptables utility.

Packet filtering: If a packet matchs a rule, Netfilter accepts or denies the packets or takes appropriate action based on defined rules.
Address translation: If a packeet matchs a rule, Netfilter alters the packet to meet the address translation requirements.

Netfilter Connection tracking

NEW: packet attempting to establish new connection
ESTABLISED: packet goes through established connection
RELATED: packet which is related to previous packets
INVALID: packet which is unknown state due to malformed or invalid packet

TCP/IP

Connection establishment
Connection close
The client sends a FIN packet to the server to start the connection termination process.
The server sends a ACK of the FIN back and then sends the FIN packet to the client if has no data to send to the client.
The client sends an ACK packet to the server to complete connection termination.

Traffic control

TCP/IP transfer window
Basically, the TCP transfer window is the maximum amount of data a given host can send or receive before requiring an ACK from the other side of the connection.
The windows size is offered from the receiving host to the sending host by the window size field in the TCP header.
Retransmission
TCP/IP handles the timeouts and data retransmission problem by queuing packets and trying to send packets several times.

Offload

If the neetwork adapter on your system supports hardware offload functionality, the kernel can offload part of its task to the adapter and it can reduce CPU utilization.

Checksum offload
TCP segmentation offload

Bonding module

1.6 Understanding Linux performance metrics

Processor metrics

CPU utilization
User time
System time
Waiting
Idel time
Nice time
Load average
Runable processes
Blocked
Context switch
Interrupts

Memory metrics

Free memory
Swap usage
Buffer and cache
Slabs
Active versus inactive memory

Network interface metrics

Packets received and sent
Bytes received and sent
Collistions per second
Packets dropped
Overruns
Errors

Block device metrics

lowait
Average queue length
Average wait
Transfers per second
Blocks read/write per second
Kilobytes per second read/write

2 Monitoring and benchmark tools

要查看或添加评论，请登录

Reza Bojnordi的更多文章

High Availability Kubernetes Cluster with Ceph Storage Deployment

2025年3月4日

High Availability Kubernetes Cluster with Ceph Storage Deployment

Introduction This article covers the deployment of a high-availability Kubernetes cluster with load balancing, control…
Tuning 10Gb network cards on Linux

2025年3月2日

Tuning 10Gb network cards on Linux

A basic introduction to concepts used to tune fast network cards Article: https://www.kernel.

4 条评论
Enhancing Storage Performance with LVM Caching (improve HDD disk)

2025年3月2日

Enhancing Storage Performance with LVM Caching (improve HDD disk)

The guide explains how to set up LVM caching, including disk preparation, logical volume creation, and integrating thin…

5 条评论
Deploying RBD Mirror in Ceph for Disaster Recovery

2025年3月2日

Deploying RBD Mirror in Ceph for Disaster Recovery

Below is a detailed article that explains how to deploy RBD mirroring for disaster recovery in Ceph, complete with…

1 条评论
Optimizing MySQL Database for High Performance in Cloud Infrastructure

2025年2月20日

Optimizing MySQL Database for High Performance in Cloud Infrastructure

Databases are the backbone of most cloud applications, and their performance can make or break your system. This…

2 条评论
How to deploy OpenStack with Kola Ansible

2025年2月19日

How to deploy OpenStack with Kola Ansible

To deploy OpenStack using Kolla with your provided configuration, here's a summary of the requirements and a detailed…
Why We Built a CDN with Nginx and how to Build CDN and Active DDos on it

2025年2月16日

Why We Built a CDN with Nginx and how to Build CDN and Active DDos on it

In today's fast-paced digital world, content delivery speed is critical. A slow website leads to high bounce rates and…
Building a Simple CDN with Nginx and Docker: A Hands-on Approach ?????

2025年2月11日

Building a Simple CDN with Nginx and Docker: A Hands-on Approach ?????

In the world of web performance, a Content Delivery Network (CDN) is essential for optimizing load times and ensuring a…
Comprehensive Ceph Hardware Recommendations for Optimal Performance and Scalability

2025年2月11日

Comprehensive Ceph Hardware Recommendations for Optimal Performance and Scalability

Introduction Deploying a Ceph cluster is not just about installing the software—it’s about building a robust foundation…
Advanced Ceph Storage Tuning Guide: Linux, Network, and Ceph Configuration

2025年2月10日

Advanced Ceph Storage Tuning Guide: Linux, Network, and Ceph Configuration

1. Linux System Tuning Optimizing the underlying Linux OS is essential for stable and high-performing Ceph storage.

2 条评论

See all articles

1.1 Linux process management

Life cycle of processes

Copy On Write

Thread

Process priority and nice level

Context switching

Interrupt handling

Process state

Zombie processes

Process memory segments

Linux CPU scheduler

1.2 Linux memory architecture

Virtual memory manager

Page frame allocation

Buddy system

Page frame reclaiming

1.3 Linux file systems

Virtual file system

Journaling

Ext2

领英推荐

Ext3

1.4 Disk I/O subsystem

I/O subsystem architecture

Cache

Block layer

1.4 RAID and storage system

1.5 Network subsystem

1.6 Understanding Linux performance metrics

2 Monitoring and benchmark tools

Reza Bojnordi的更多文章

High Availability Kubernetes Cluster with Ceph Storage Deployment

Tuning 10Gb network cards on Linux

Enhancing Storage Performance with LVM Caching (improve HDD disk)

Deploying RBD Mirror in Ceph for Disaster Recovery

Optimizing MySQL Database for High Performance in Cloud Infrastructure

How to deploy OpenStack with Kola Ansible

Why We Built a CDN with Nginx and how to Build CDN and Active DDos on it

Building a Simple CDN with Nginx and Docker: A Hands-on Approach ?????

Comprehensive Ceph Hardware Recommendations for Optimal Performance and Scalability

Advanced Ceph Storage Tuning Guide: Linux, Network, and Ceph Configuration

社区洞察

其他会员也浏览了

What is?K9s?

Finding Deprecated Kubernetes Resources with Pluto

CheatSheet: Linux Commands for DevOps

Learning CRIBL AppScope 101: A Crash course to using AppScope for performance troubleshooting and Linux Forensics?-?Part?1

Linux : Operation Deployment (Day 4)

Alternative to Kubernetes: DOCKER?

Linux Commands Cheatsheet

Kubernetes Configuration with Kustomize.

Write your own eBPF tools with libbpf-bootstrap

Resolving SDK Resolver Failure in .NET MAUI Projects