登录查看更多内容

Understanding Spinlocks - How CPU supports Atomic locks

Deepesh Menon

Principal Engineer | Heterogeneous Computing Systems | Virtualization | Embedded Systems

发布日期: 2024年10月27日

In multi-core systems, managing shared resources across threads and cores is essential. For this purpose, spinlocks are a common synchronization tool. They allow threads to “spin” in a loop, waiting until they can acquire the lock and proceed with their operations. Spinlocks are widely used in low-level programming, particularly in:

Operating System Kernels: To synchronize access to shared kernel data.
Embedded Systems: Where waiting threads have minimal workload and need fast synchronization.
High-Performance Applications: Where threads perform frequent, small, and fast critical sections.

Spinlocks work best when a CPU instruction set supports atomic operations like compare-and-set—allowing a core to safely acquire a lock even while others may be trying to do the same. Without atomic instructions, multiple threads or cores could see a lock as free simultaneously, leading to race conditions. Let’s dive into how ARM, Intel, and PowerPC handle these atomic operations and see how ARM's LDXR and STXR instructions work in action.

How Spinlocks Use Atomic Operations

A key requirement for implementing spinlocks is atomicity—the ability to perform a sequence of operations (like checking and setting a lock) as a single, uninterruptible unit. CPU architectures provide instructions for atomic operations, often using a compare-and-set approach to ensure mutual exclusion across cores.

Here’s how different architectures support atomic operations:

ARM Architecture: ARM processors use LDXR (Load-Exclusive Register) and STXR (Store-Exclusive Register) instructions for atomic operations. These instructions are specifically designed to handle shared memory updates in a multi-core setup.
Intel x86 Architecture: Intel supports atomicity with the LOCK prefix combined with instructions like CMPXCHG (Compare and Exchange). The LOCK prefix ensures that the operation locks the memory bus, making it atomic across cores.
PowerPC Architecture: PowerPC uses LWARX (Load Word and Reserve Indexed) and STWCX (Store Word Conditional) for atomic operations. Similar to ARM’s mechanism, these instructions reserve a memory address to ensure atomicity in multi-core environments.

These architecture-specific instructions are optimized to prevent multiple threads from modifying the same memory location simultaneously, enabling efficient synchronization.

Implementing Spinlocks with ARM LDXR and STXR

In ARM, LDXR and STXR work together to provide atomic access to memory. Here’s how each instruction contributes to spinlock functionality:

LDXR (Load-Exclusive Register): Loads a value from memory into a register and marks the memory location as “exclusive” in the exclusive monitor. This means that only the core that executed LDXR has exclusive access to that address.
STXR (Store-Exclusive Register): Attempts to store a value to the exclusive address. If no other core has modified that address since LDXR, the store succeeds, and the instruction sets a flag (e.g., 0 to indicate success). If another core modified the address, STXR fails, and the flag is set (e.g., 1 to indicate failure).

Together, LDXR and STXR provide a way to check a lock’s value, decide on an update, and apply it atomically.

Here’s an illustration showing two CPU cores, Core 0 and Core 1, both trying to access the same lock memory location using LDXR and STXR. This example demonstrates a typical spinlock scenario where Core 0 loads the lock with LDXR, but before it can store with STXR, Core 1 successfully acquires the lock. As a result, Core 0’s STXR fails, and it enters a loop, retrying LDXR and STXR until it can successfully acquire the lock.

Spinlock - Core 1 releases lock, Core 0 succeeds locks memory

Explanation of the Diagram

Initial State: The lock is initially 0 (unlocked) in shared memory.

Core 1 Acquires Lock:

Both Core 0 and Core 1 perform LDXR on the lock. They both read 0, indicating the lock is free.
Core 1 successfully executes STXR and sets the lock to 1, entering the critical section. Core 0, however, hasn’t yet performed STXR, so it doesn’t have the lock.

领英推荐

Types of Memory

Vivek Bansal 10 个月前

Reverse CPU

Anatoly Denisov, MS 3 个月前

Locked vs. Unlocked CPU: Which Is Better?: A Quick…

Robiul Hossain 1 个月前

Core 0’s STXR Fails:

Core 0 tries STXR but fails since Core 1 has modified the lock. Core 0 then loops, retrying LDXR and STXR in a "spin" loop until the lock becomes available again.

Core 1 Releases Lock:

After finishing its critical section, Core 1 performs a STLR to set the lock back to 0, releasing it.

Core 0 Acquires Lock:

Core 0’s looped LDXR and STXR operations continue, and it eventually reads 0 for the lock.
Core 0’s next STXR succeeds, setting the lock to 1 and allowing Core 0 to enter the critical section.

This sequence demonstrates how LDXR/STXR enable atomic spinlock acquisition and release, ensuring only one core can hold the lock at any time.

Spinlock-Atomicity on other CPU ISA's

Similar Instructions in x86 and PowerPC

While ARM uses LDXR and STXR for atomic operations, other architectures provide their own mechanisms for atomicity in spinlocks:

Intel x86:x86 processors use the LOCK prefix with instructions like CMPXCHG (Compare and Exchange) to enforce atomicity. The LOCK prefix locks the memory bus if necessary, ensuring that the operation executes as a single, atomic unit across cores.
PowerPC (PPC):PowerPC provides LWARX (Load Word and Reserve Indexed) and STWCX (Store Word Conditional), which function similarly to ARM's LDXR and STXR by marking addresses as reserved for atomic operations. This reservation ensures that the operation completes atomically, even in a multi-core setting.

These architecture-specific instructions enable efficient locking and unlocking in multi-threaded or multi-core systems, making them ideal for implementing spinlocks.

Why Spinlocks Rely on Atomic Compare-and-Set

The core of any spinlock implementation is an atomic compare-and-set operation. This atomicity is crucial for ensuring that only one core can acquire the lock at any given time, preventing race conditions.

Each architecture provides instructions that enable atomic compare-and-set:

ARM: LDXR and STXR work together with the exclusive monitor to enable atomic updates.
x86: The LOCK CMPXCHG (Compare and Exchange) instruction with the LOCK prefix ensures atomicity.
PowerPC: The LWARX and STWCX instructions reserve and conditionally update memory addresses to prevent race conditions.

These built-in atomic operations are fundamental for fast, reliable spinlocks, allowing cores to coordinate efficiently when accessing shared data.

A Note on Memory Barriers

While this article focuses on the LDXR and STXR instructions for implementing spinlocks, memory barriers are often needed to prevent reordering of operations within the critical section. These barriers ensure that instructions execute in the intended order, avoiding potential consistency issues. I’ll cover memory barriers in more detail in an upcoming article.

Summary

In multi-core systems, efficient synchronization is the backbone of reliable performance, and spinlocks provide a fast, minimal-overhead solution. Through ARM's powerful LDXR and STXR instructions, we see a remarkable mechanism that enforces atomicity and prevents race conditions, ensuring that only one core controls a critical resource at any moment. By leveraging these low-level atomic operations, various CPU ISA's, empowers high-performance applications to thrive in concurrent environments. This atomic foundation—securing, waiting, and retrying—exemplifies how strategic hardware design meets the demands of modern, multi-threaded workloads, giving developers the tools to build systems that are not just fast, but resilient and consistent.

Vyacheslav Moskvin

Senior Security Researcher / Engineer | Hardware | IoT

1 周

A very nice post, wondered how that worked under the hood for a while!

1 次回应

Matthias Rosenfelder

OS Kernel Engineer, ARM Architecture Enthusiast

3 周

You don‘t need to spin on the spinlock variable if the lock is taken on ARM. You can use LDXR together with WFE (wait for event instruction). That is because a lost reservation is a wakeup event on ARMv8-A (and IIRC also on ARMv7-A). I.e. the unlock (store to zero) of the lock wakes up the (/all) waiting core(s). This provides a low-power sleep mechanism for acquiring a spinlock that has the same performance than an active wait (spin). I am not aware of any other (relevant) architecture other than ARM that supports such a mechanism - notably this is missing from RISC-V.

2 次回应

Eduard Drusa

Crafting operating system for fun and profit | Software is not a crankshaft

4 周

The trick here is, that in certain setups, Load/Store exclusive is not guaranteed to be propagated across cores. E.g. in cases like Cortex-M7 + Cortex-M4 cores on same package. Quite often there's some vendor-specific way how to implement spinlocks for SMP or AMP. But that's not a problem. Load/Store Exclusive are usable for much more than just multi-core setups. They are ARM's building block for atomic operations. You can build a whole lot of features on top of it.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Understanding Spinlocks - How CPU supports Atomic locks

Deepesh Menon

Principal Engineer | Heterogeneous Computing Systems | Virtualization | Embedded Systems

How Spinlocks Use Atomic Operations

Implementing Spinlocks with ARM LDXR and STXR

Explanation of the Diagram

领英推荐

Spinlock-Atomicity on other CPU ISA's

Similar Instructions in x86 and PowerPC

Why Spinlocks Rely on Atomic Compare-and-Set

A Note on Memory Barriers

Summary

更多精彩文章

社区洞察

其他会员也浏览了

CENTRAL PROCESSING UNIT

The Evolution of Processor Architectures and the Future of Computing

Demystifying Memory Sub-systems Part1: Caches

x86 protected mode and Long Mode x86-64 and the equivalents on ARM.

Scheduling in Multiprocessor Systems

Decoding RAM Timings: Unveiling the Secrets of Your Computer's Memory ????

CPU Cores Demystified: What's Really Going on Inside Your Processor? ????

Why Clock Speed Isn't Everything: Unraveling the CPU Mystery ?????

CPU Sockets Demystified: Why They're More Than Just Holes in a Board ????

CPU works. Oh really? But how?

How Spinlocks Use Atomic Operations

Implementing Spinlocks with ARM LDXR and STXR

Explanation of the Diagram

领英推荐

Spinlock-Atomicity on other CPU ISA's

Similar Instructions in x86 and PowerPC

Why Spinlocks Rely on Atomic Compare-and-Set

A Note on Memory Barriers

Summary

Spinlocks vs. Semaphores: Understanding Synchronization Mechanisms

2024年11月25日

Data Structure Selection in Embedded Systems: Maximizing Cache Efficiency and Security

2024年11月5日

C++ References: The Timeless Mark of Simplicity and Elegance

2024年11月2日

Leveraging ARM v9 Confidential Compute Architecture (CCA) for Secure and Isolated Avionics Integrated Modular Avionics (IMA) Applications

2024年10月31日

Why C++ Threads Matter Despite the Existence of POSIX Threads

2024年10月30日

Cache-Aware Memory Allocation Techniques for RTOS

2024年10月29日

Operating System Synchronization Primitives: Mutex Locks

2024年10月28日

The Shift from x86 to ARM in Laptops and Desktops: What's Driving the Trend?

2024年10月20日

Operating System Fundamentals, Part 4 - The Origins

2024年10月17日

Operating System Fundamentals: Part 3 – Software Essentials!

2024年10月16日

社区洞察

其他会员也浏览了

CENTRAL PROCESSING UNIT

The Evolution of Processor Architectures and the Future of Computing

Demystifying Memory Sub-systems Part1: Caches

x86 protected mode and Long Mode x86-64 and the equivalents on ARM.

Scheduling in Multiprocessor Systems

Decoding RAM Timings: Unveiling the Secrets of Your Computer's Memory ????

CPU Cores Demystified: What's Really Going on Inside Your Processor? ????

Why Clock Speed Isn't Everything: Unraveling the CPU Mystery ?????

CPU Sockets Demystified: Why They're More Than Just Holes in a Board ????

CPU works. Oh really? But how?