Understanding Spinlocks - How CPU supports Atomic locks

Understanding Spinlocks - How CPU supports Atomic locks

In multi-core systems, managing shared resources across threads and cores is essential. For this purpose, spinlocks are a common synchronization tool. They allow threads to “spin” in a loop, waiting until they can acquire the lock and proceed with their operations. Spinlocks are widely used in low-level programming, particularly in:

  • Operating System Kernels: To synchronize access to shared kernel data.
  • Embedded Systems: Where waiting threads have minimal workload and need fast synchronization.
  • High-Performance Applications: Where threads perform frequent, small, and fast critical sections.

Spinlocks work best when a CPU instruction set supports atomic operations like compare-and-set—allowing a core to safely acquire a lock even while others may be trying to do the same. Without atomic instructions, multiple threads or cores could see a lock as free simultaneously, leading to race conditions. Let’s dive into how ARM, Intel, and PowerPC handle these atomic operations and see how ARM's LDXR and STXR instructions work in action.

How Spinlocks Use Atomic Operations

A key requirement for implementing spinlocks is atomicity—the ability to perform a sequence of operations (like checking and setting a lock) as a single, uninterruptible unit. CPU architectures provide instructions for atomic operations, often using a compare-and-set approach to ensure mutual exclusion across cores.

Here’s how different architectures support atomic operations:

  • ARM Architecture: ARM processors use LDXR (Load-Exclusive Register) and STXR (Store-Exclusive Register) instructions for atomic operations. These instructions are specifically designed to handle shared memory updates in a multi-core setup.
  • Intel x86 Architecture: Intel supports atomicity with the LOCK prefix combined with instructions like CMPXCHG (Compare and Exchange). The LOCK prefix ensures that the operation locks the memory bus, making it atomic across cores.
  • PowerPC Architecture: PowerPC uses LWARX (Load Word and Reserve Indexed) and STWCX (Store Word Conditional) for atomic operations. Similar to ARM’s mechanism, these instructions reserve a memory address to ensure atomicity in multi-core environments.

These architecture-specific instructions are optimized to prevent multiple threads from modifying the same memory location simultaneously, enabling efficient synchronization.

Implementing Spinlocks with ARM LDXR and STXR

In ARM, LDXR and STXR work together to provide atomic access to memory. Here’s how each instruction contributes to spinlock functionality:

  1. LDXR (Load-Exclusive Register): Loads a value from memory into a register and marks the memory location as “exclusive” in the exclusive monitor. This means that only the core that executed LDXR has exclusive access to that address.
  2. STXR (Store-Exclusive Register): Attempts to store a value to the exclusive address. If no other core has modified that address since LDXR, the store succeeds, and the instruction sets a flag (e.g., 0 to indicate success). If another core modified the address, STXR fails, and the flag is set (e.g., 1 to indicate failure).

Together, LDXR and STXR provide a way to check a lock’s value, decide on an update, and apply it atomically.

Here’s an illustration showing two CPU cores, Core 0 and Core 1, both trying to access the same lock memory location using LDXR and STXR. This example demonstrates a typical spinlock scenario where Core 0 loads the lock with LDXR, but before it can store with STXR, Core 1 successfully acquires the lock. As a result, Core 0’s STXR fails, and it enters a loop, retrying LDXR and STXR until it can successfully acquire the lock.


Spinlock -


Spinlock - Core 1 releases lock, Core 0 succeeds locks memory


Explanation of the Diagram

Initial State: The lock is initially 0 (unlocked) in shared memory.

Core 1 Acquires Lock:

  • Both Core 0 and Core 1 perform LDXR on the lock. They both read 0, indicating the lock is free.
  • Core 1 successfully executes STXR and sets the lock to 1, entering the critical section. Core 0, however, hasn’t yet performed STXR, so it doesn’t have the lock.

Core 0’s STXR Fails:

  • Core 0 tries STXR but fails since Core 1 has modified the lock. Core 0 then loops, retrying LDXR and STXR in a "spin" loop until the lock becomes available again.

Core 1 Releases Lock:

  • After finishing its critical section, Core 1 performs a STLR to set the lock back to 0, releasing it.

Core 0 Acquires Lock:

  • Core 0’s looped LDXR and STXR operations continue, and it eventually reads 0 for the lock.
  • Core 0’s next STXR succeeds, setting the lock to 1 and allowing Core 0 to enter the critical section.

This sequence demonstrates how LDXR/STXR enable atomic spinlock acquisition and release, ensuring only one core can hold the lock at any time.

Spinlock-Atomicity on other CPU ISA's

Similar Instructions in x86 and PowerPC

While ARM uses LDXR and STXR for atomic operations, other architectures provide their own mechanisms for atomicity in spinlocks:

  • Intel x86:x86 processors use the LOCK prefix with instructions like CMPXCHG (Compare and Exchange) to enforce atomicity. The LOCK prefix locks the memory bus if necessary, ensuring that the operation executes as a single, atomic unit across cores.
  • PowerPC (PPC):PowerPC provides LWARX (Load Word and Reserve Indexed) and STWCX (Store Word Conditional), which function similarly to ARM's LDXR and STXR by marking addresses as reserved for atomic operations. This reservation ensures that the operation completes atomically, even in a multi-core setting.

These architecture-specific instructions enable efficient locking and unlocking in multi-threaded or multi-core systems, making them ideal for implementing spinlocks.

Why Spinlocks Rely on Atomic Compare-and-Set

The core of any spinlock implementation is an atomic compare-and-set operation. This atomicity is crucial for ensuring that only one core can acquire the lock at any given time, preventing race conditions.

Each architecture provides instructions that enable atomic compare-and-set:

  • ARM: LDXR and STXR work together with the exclusive monitor to enable atomic updates.
  • x86: The LOCK CMPXCHG (Compare and Exchange) instruction with the LOCK prefix ensures atomicity.
  • PowerPC: The LWARX and STWCX instructions reserve and conditionally update memory addresses to prevent race conditions.

These built-in atomic operations are fundamental for fast, reliable spinlocks, allowing cores to coordinate efficiently when accessing shared data.

A Note on Memory Barriers

While this article focuses on the LDXR and STXR instructions for implementing spinlocks, memory barriers are often needed to prevent reordering of operations within the critical section. These barriers ensure that instructions execute in the intended order, avoiding potential consistency issues. I’ll cover memory barriers in more detail in an upcoming article.

Summary

In multi-core systems, efficient synchronization is the backbone of reliable performance, and spinlocks provide a fast, minimal-overhead solution. Through ARM's powerful LDXR and STXR instructions, we see a remarkable mechanism that enforces atomicity and prevents race conditions, ensuring that only one core controls a critical resource at any moment. By leveraging these low-level atomic operations, various CPU ISA's, empowers high-performance applications to thrive in concurrent environments. This atomic foundation—securing, waiting, and retrying—exemplifies how strategic hardware design meets the demands of modern, multi-threaded workloads, giving developers the tools to build systems that are not just fast, but resilient and consistent.


Vyacheslav Moskvin

Senior Security Researcher / Engineer | Hardware | IoT

1 周

A very nice post, wondered how that worked under the hood for a while!

Matthias Rosenfelder

OS Kernel Engineer, ARM Architecture Enthusiast

3 周

You don‘t need to spin on the spinlock variable if the lock is taken on ARM. You can use LDXR together with WFE (wait for event instruction). That is because a lost reservation is a wakeup event on ARMv8-A (and IIRC also on ARMv7-A). I.e. the unlock (store to zero) of the lock wakes up the (/all) waiting core(s). This provides a low-power sleep mechanism for acquiring a spinlock that has the same performance than an active wait (spin). I am not aware of any other (relevant) architecture other than ARM that supports such a mechanism - notably this is missing from RISC-V.

Eduard Drusa

Crafting operating system for fun and profit | Software is not a crankshaft

4 周

The trick here is, that in certain setups, Load/Store exclusive is not guaranteed to be propagated across cores. E.g. in cases like Cortex-M7 + Cortex-M4 cores on same package. Quite often there's some vendor-specific way how to implement spinlocks for SMP or AMP. But that's not a problem. Load/Store Exclusive are usable for much more than just multi-core setups. They are ARM's building block for atomic operations. You can build a whole lot of features on top of it.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了