Understanding Spinlocks - How CPU supports Atomic locks
Deepesh Menon
Principal Engineer | Heterogeneous Computing Systems | Virtualization | Embedded Systems
In multi-core systems, managing shared resources across threads and cores is essential. For this purpose, spinlocks are a common synchronization tool. They allow threads to “spin” in a loop, waiting until they can acquire the lock and proceed with their operations. Spinlocks are widely used in low-level programming, particularly in:
Spinlocks work best when a CPU instruction set supports atomic operations like compare-and-set—allowing a core to safely acquire a lock even while others may be trying to do the same. Without atomic instructions, multiple threads or cores could see a lock as free simultaneously, leading to race conditions. Let’s dive into how ARM, Intel, and PowerPC handle these atomic operations and see how ARM's LDXR and STXR instructions work in action.
How Spinlocks Use Atomic Operations
A key requirement for implementing spinlocks is atomicity—the ability to perform a sequence of operations (like checking and setting a lock) as a single, uninterruptible unit. CPU architectures provide instructions for atomic operations, often using a compare-and-set approach to ensure mutual exclusion across cores.
Here’s how different architectures support atomic operations:
These architecture-specific instructions are optimized to prevent multiple threads from modifying the same memory location simultaneously, enabling efficient synchronization.
Implementing Spinlocks with ARM LDXR and STXR
In ARM, LDXR and STXR work together to provide atomic access to memory. Here’s how each instruction contributes to spinlock functionality:
Together, LDXR and STXR provide a way to check a lock’s value, decide on an update, and apply it atomically.
Here’s an illustration showing two CPU cores, Core 0 and Core 1, both trying to access the same lock memory location using LDXR and STXR. This example demonstrates a typical spinlock scenario where Core 0 loads the lock with LDXR, but before it can store with STXR, Core 1 successfully acquires the lock. As a result, Core 0’s STXR fails, and it enters a loop, retrying LDXR and STXR until it can successfully acquire the lock.
Explanation of the Diagram
Initial State: The lock is initially 0 (unlocked) in shared memory.
Core 1 Acquires Lock:
Core 0’s STXR Fails:
Core 1 Releases Lock:
Core 0 Acquires Lock:
This sequence demonstrates how LDXR/STXR enable atomic spinlock acquisition and release, ensuring only one core can hold the lock at any time.
Spinlock-Atomicity on other CPU ISA's
Similar Instructions in x86 and PowerPC
While ARM uses LDXR and STXR for atomic operations, other architectures provide their own mechanisms for atomicity in spinlocks:
These architecture-specific instructions enable efficient locking and unlocking in multi-threaded or multi-core systems, making them ideal for implementing spinlocks.
Why Spinlocks Rely on Atomic Compare-and-Set
The core of any spinlock implementation is an atomic compare-and-set operation. This atomicity is crucial for ensuring that only one core can acquire the lock at any given time, preventing race conditions.
Each architecture provides instructions that enable atomic compare-and-set:
These built-in atomic operations are fundamental for fast, reliable spinlocks, allowing cores to coordinate efficiently when accessing shared data.
A Note on Memory Barriers
While this article focuses on the LDXR and STXR instructions for implementing spinlocks, memory barriers are often needed to prevent reordering of operations within the critical section. These barriers ensure that instructions execute in the intended order, avoiding potential consistency issues. I’ll cover memory barriers in more detail in an upcoming article.
Summary
In multi-core systems, efficient synchronization is the backbone of reliable performance, and spinlocks provide a fast, minimal-overhead solution. Through ARM's powerful LDXR and STXR instructions, we see a remarkable mechanism that enforces atomicity and prevents race conditions, ensuring that only one core controls a critical resource at any moment. By leveraging these low-level atomic operations, various CPU ISA's, empowers high-performance applications to thrive in concurrent environments. This atomic foundation—securing, waiting, and retrying—exemplifies how strategic hardware design meets the demands of modern, multi-threaded workloads, giving developers the tools to build systems that are not just fast, but resilient and consistent.
Senior Security Researcher / Engineer | Hardware | IoT
1 周A very nice post, wondered how that worked under the hood for a while!
OS Kernel Engineer, ARM Architecture Enthusiast
3 周You don‘t need to spin on the spinlock variable if the lock is taken on ARM. You can use LDXR together with WFE (wait for event instruction). That is because a lost reservation is a wakeup event on ARMv8-A (and IIRC also on ARMv7-A). I.e. the unlock (store to zero) of the lock wakes up the (/all) waiting core(s). This provides a low-power sleep mechanism for acquiring a spinlock that has the same performance than an active wait (spin). I am not aware of any other (relevant) architecture other than ARM that supports such a mechanism - notably this is missing from RISC-V.
Crafting operating system for fun and profit | Software is not a crankshaft
4 周The trick here is, that in certain setups, Load/Store exclusive is not guaranteed to be propagated across cores. E.g. in cases like Cortex-M7 + Cortex-M4 cores on same package. Quite often there's some vendor-specific way how to implement spinlocks for SMP or AMP. But that's not a problem. Load/Store Exclusive are usable for much more than just multi-core setups. They are ARM's building block for atomic operations. You can build a whole lot of features on top of it.