登录查看更多内容

C++ Lock Free Stack

Bryan Zimmerman

Expert modern C++ 23+ low latency HPC Software, Hardware, FPGA, and RF Engineer, Consultant, BAZ Innovations, LLC Owner

发布日期: 2025年3月22日

I. Overview

What I want to write about today is the creation of a lock free stack. At first glance this seems not so hard; however, there is no learning but doing. A stack is a FIFO(first in, first out) container that normally supports the following operations push(put an element onto the stack) and pop(take an element off of the stack). Wikipedia Stack Overview. Why would you want to use a lock free stack? The short answer is if you want a lower latency system by eliminating context switches. Lower latency does not necessarily translate to higher throughput; and in order to have determined latency the stack would also need to be wait-free which is a much more difficult problem.

II. Wrapping Standard Library Stack

In order to see where we stand if we take the simple approach; although not lock free we will wrap the std::stack from the C++ standard library with a std::mutex for synchronization between multiple threads. One could also use a std::shared_mutex if it was read heavy versus write since a std::shared_mutex is non blocking to multiple simultaneous readers.

Our implementation looks like the following:

#include <cstdint>
#include <cstdlib>
#include <mutex>
#include <optional>
#include <stack>
#include <type_traits>

/// Stack wrapping std::stack
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackLock
{
public:
    // typedef's
    typedef DataType data_type;

    /// The default constructor
    StackLock() = default;

    /// Destructor
    ~StackLock();

    StackLock(const StackLock&) = delete;
    StackLock& operator=(const StackLock&) = delete;
    StackLock(StackLock&&) noexcept = delete;
    StackLock& operator=(StackLock&&) noexcept = delete;

    /// Insert an element into the stack
    /// @tparam InsertType The type of the element to insert
    /// @param element The element to be inserted
    template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
    inline void push(InsertType&& element);

    /// Remove an element from the stack
    /// @return Returns the optional element if successful  
    [[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;

private:
    std::mutex m_Lock;
    std::stack<data_type> m_Stack;
};

template <typename DataType>
StackLock<DataType>::~StackLock()
{
    while (!m_Stack.empty())
    {
        m_Stack.pop();
    }
}

template <typename DataType>
template <typename InsertType, typename>
inline void StackLock<DataType>::push(InsertType&& element)
{
    std::unique_lock lock{ m_Lock };
    m_Stack.emplace(std::forward<InsertType>(element));
}

template <typename DataType>
[[nodiscard]] inline auto StackLock<DataType>::pop() noexcept -> std::optional<data_type>
{
    std::unique_lock lock{ m_Lock };
    if (m_Stack.empty())
    {
        return std::nullopt;
    }
    auto val {std::move(m_Stack.top()) };
    m_Stack.pop();
    return std::move(val);
}

This is pretty simple in that we just use a std::unique_lock in both the push and the pop which locks when created and releases when going out of scope. The benchmarks with the caveat that these are taken with verification tracking code built in. So if we wanted true benchmarks we would micro benchmark without verification code. Also the number of readers and writers are equal, i.e. 50% read and 50% write.

Threads: 2 - Elements / s 1.36081e+07
Threads: 4 - Elements / s 7.08468e+06
Threads: 8 - Elements / s 4.76185e+06
Threads: 16 - Elements / s 2.88997e+06
Threads: 32 - Elements / s 1.23162e+06

This is somewhat unfair since the std::stack is normally based on a vector since it is very efficient to put and take off and element from the end. Vectors also have natural cache coherency. Our implementations are going to be based on a forward_list so lets look at that one.

III. Wrapping Standard Library Forward List

We will take the same approach of wrapping the pop and push routines with std::unique_lock. Our code is as follows:

/// A wrapped std::forward_list
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackLockList
{
public:
    // typedef's
    typedef DataType data_type;

    /// The default constructor
    StackLockList() = default;

    /// Destructor
    ~StackLockList();

    StackLockList(const StackLockList&) = delete;
    StackLockList& operator=(const StackLockList&) = delete;
    StackLockList(StackLockList&&) noexcept = delete;
    StackLockList& operator=(StackLockList&&) noexcept = delete;

    /// Insert an element into the stack
    /// @tparam InsertType The type of the element to insert
    /// @param element The element to be inserted
    template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
    inline void push(InsertType&& element);

    /// Remove an element from the stack
    /// @return Returns the optional element if successful  
    [[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;

private:
    std::mutex m_Lock;
    std::forward_list<data_type> m_Stack;
};

template <typename DataType>
StackLockList<DataType>::~StackLockList()
{
    while (!m_Stack.empty())
    {
        m_Stack.pop_front();
    }
}

template <typename DataType>
template <typename InsertType, typename>
inline void StackLockList<DataType>::push(InsertType&& element)
{
    std::unique_lock lock{ m_Lock };
    m_Stack.emplace_front(std::forward<InsertType>(element));
}

template <typename DataType>
[[nodiscard]] inline auto StackLockList<DataType>::pop() noexcept -> std::optional<data_type>
{
    std::unique_lock lock{ m_Lock };
    if (m_Stack.empty())
    {
        return std::nullopt;
    }
    auto val {std::move(m_Stack.front()) };
    m_Stack.pop_front();
    return std::move(val);
}

The results of this are not as good as wrapping the std::stack which we would expect but pretty decent except under very high contention where it falls off sharply.

Threads: 2 - Elements / s 8.68647e+06
Threads: 4 - Elements / s 6.73833e+06
Threads: 8 - Elements / s 3.96009e+06
Threads: 16 - Elements / s 2.08256e+06
Threads: 32 - Elements / s 640039

IV. Initial Implementation

The naive approach which is prevalent on the web follows along the lines of something like:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <cstdlib>
#include <optional>
#include <type_traits>

/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class Stack
{
public:
    // typedef's
    typedef DataType data_type;

    /// The default constructor
    Stack() = default;

    /// Destructor
    ~Stack();

    Stack(const Stack&) = delete;
    Stack& operator=(const Stack&) = delete;
    Stack(Stack&&) noexcept = delete;
    Stack& operator=(Stack&&) noexcept = delete;

    /// Insert an element into the stack
    /// @tparam InsertType The type of the element to insert
    /// @param element The element to be inserted
    template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
    inline void push(InsertType&& element);

    /// Remove an element from the stack
    /// @return Returns the optional element if successful  
    [[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;

private:
    struct Node
    {
        explicit Node(data_type value_) : value{ std::move(value_) }, pNext{ nullptr } {}
        data_type value;
        Node* pNext;
    };

    std::atomic<Node*> m_Head{ nullptr };
};

template <typename DataType>
Stack<DataType>::~Stack()
{
   while(pop()) {};
}

template <typename DataType>
template <typename InsertType, typename>
inline void Stack<DataType>::push(InsertType&& element)
{
    // create the new node
    Node* pNewNode{ new Node{ std::forward<InsertType>(element) } };
    Node* pHead{ m_Head.load(std::memory_order_acquire) };
    do
    {
        pNewNode->pNext = pHead;
    } while (!m_Head.compare_exchange_weak(pHead, pNewNode, std::memory_order_release, std::memory_order_acquire));
}

template <typename DataType>
[[nodiscard]] inline auto Stack<DataType>::pop() noexcept -> std::optional<data_type>
{
    Node* pHead{ m_Head.load(std::memory_order_acquire) };
    Node* pNext{ nullptr };
    do
    {
        if (!pHead)
        {
            return std::nullopt;
        }
        pNext = pHead->pNext;
    } while (!m_Head.compare_exchange_weak(pHead, pNext, std::memory_order_release, std::memory_order_acquire));

    // return the data
    auto value{ std::move(pHead->value) };
    delete pHead;
    return std::move(value);
}

Let's look this over.

A. Push routine

We create a new node with the value that we want to insert. We read the current head of the linked list. We set the new node next pointer to the current head, and then we try to atomically CAS replace the head with our new head. If it fails we get the new current head and repeat until successful.

This routine is also using what I consider optimal memory order. We read the current head with acquire memory order. If we replace the head node successfully we release that we replaced the value; otherwise we acquire the new node. I have seen relaxed used for the first read of the head as well as for the CAS failure state, but I think this is an optimistic enhancement since if the head has changed via another thread, relaxed gives no guarantee that this thread sees the update. So the CAS loop will continue until this thread sees the update.

B. Pop routine

We read the current head. If it is null then the stack is empty and we return. If not empty, we try to replace the head node with the next node that the head points to. We loop in CAS until the stack is empty of we successfully replace the head with the next node.

If we replaced the node then we move out the value, delete the node, and return the value. This is still pretty simple and the memory order follows the same as the push.

C. Results

This looks fairly simple in order to go lock free. And running this code will normally result in one of the following outcomes:

It will work fine
It will seem to work but some of the queue items have disappeared
It will segfault trying to access memory that has been de-allocated

The reason behind this unexpected result is something called the ABA problem. See Wikipedia ABA Problem. The main reason for this being that memory addresses are recycled during new operations because this improves cache coherency and other factors.

V. Implementation With Tagged Pointers

One of the ways around the ABA problem is to associate a counter with the pointer which is incremented when the pointer is modified. Let's look an implementation where we use the lower bits of address as a counter.

#include <atomic>
#include <cassert>
#include <cstdint>
#include <cstdlib>
#include <optional>
#include <type_traits>

/// @tparam DataType The data type of the elements stored in the circular queue
template<typename DataType>
class Stack
{
public:
    // typedef's
    typedef DataType data_type;

    /// The default constructor
    Stack() = default;

    /// Destructor
    ~Stack();

    Stack(const Stack&) = delete;
    Stack& operator=(const Stack&) = delete;
    Stack(Stack&&) noexcept = delete;
    Stack& operator=(Stack&&) noexcept = delete;

    /// Insert an element into the stack
    /// @tparam InsertType The type of the element to insert
    /// @param element The element to be inserted
    template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
    inline void push(InsertType&& element);

    /// Remove an element from the stack
    /// @return Returns the optional element if successful  
    [[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;

private:
    static constexpr size_t s_Alignment{ 64U };
    static constexpr uintptr_t s_AlignMask{ static_cast<uintptr_t>(s_Alignment - 1U) }; 
    static constexpr uintptr_t s_BaseMask{ ~s_AlignMask }; 

    struct alignas(s_Alignment) Node
    {
        explicit Node(data_type value_) : value{ std::move(value_) }, pNext{ nullptr } {}
        data_type value;
        Node* pNext;
    };

    std::atomic<Node*> m_Head{ nullptr };
};

template <typename DataType>
Stack<DataType>::~Stack()
{
    while (pop()) {}
}

template <typename DataType>
template <typename InsertType, typename>
inline void Stack<DataType>::push(InsertType&& element)
{
    // create the new node
    Node* pNodeBase{ new Node{ std::forward<InsertType>(element) } };
    Node* pNode{ nullptr };
    Node* pHead{ m_Head.load(std::memory_order_acquire) };
    do
    {
        pNodeBase->pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
        // ABA: Node Base or'd with counter
        pNode = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pNodeBase) |
                                        ((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));
    } while (!m_Head.compare_exchange_weak(pHead, pNode, std::memory_order_release, std::memory_order_acquire));
}

template <typename DataType>
[[nodiscard]] inline auto Stack<DataType>::pop() noexcept -> std::optional<data_type>
{
    Node* pHead{ m_Head.load(std::memory_order_acquire) };
    Node* pNext{ nullptr };
    Node* pBaseHead{ nullptr };
    do
    {
        pBaseHead = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
        if (!pBaseHead)
        {
            return std::nullopt;
        }
        // ABA: Node Base or'd with counter
        pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pBaseHead->pNext) |
                                        ((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));
    } while (!m_Head.compare_exchange_weak(pHead, pNext, std::memory_order_release, std::memory_order_acquire));

    // return the data
    auto value{ std::move(pBaseHead->value) };
    //delete pBaseHead;
    return std::move(value);
}

The changes to the structure are

    static constexpr size_t s_Alignment{ 64U };
    static constexpr uintptr_t s_AlignMask{ static_cast<uintptr_t>(s_Alignment - 1U) }; 
    static constexpr uintptr_t s_BaseMask{ ~s_AlignMask }; 

    struct alignas(s_Alignment) Node

where we define an alignment of 64 Bytes which will leave us 6 bits free to act as a counter. We then compute an alignment mask for the counter and a base mask to give us back the unadulterated pointer. We also define an alignas for the Node struct which will force it to be created on memory where the last 6 bits are always 0. If one does searches on the web for this method you will probably find cases where the upper 16 bits are used for a counter. This is non portable between architectures where as the alignas method proposed above is.

A. Push routine

The change from the naive implementation is here:

        pNodeBase->pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
        // ABA: Node Base or'd with counter
        pNode = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pNodeBase) |
                                        ((reinterpret_cast<uintptr_t>(pHead) + 0x1U) &

We assign the new node next to the untagged head node, i.e. the real address of the next node. Then we tag the pointer with one greater than the current head pointer; since the head is what we are modifying. The rest of the routine is unchanges.

B. Pop routine

The change from the naive implementation is here:

        pBaseHead = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
        if (!pBaseHead)
        {
            return std::nullopt;
        }
        // ABA: Node Base or'd with counter
        pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pBaseHead->pNext) |
                                        ((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));

We get the true(untagged) address of the head node. If it is null then the stack is empty. Otherwise, we tag the next pointer with 1 greater than the current head since it is the head we are modifying. The only other significant change is that we no longer delete the node to avoid accessing memory which has been de-allocated. Obviously this will leak memory, and we will come back to this later on.

C. Results

The benchmark results of this lock free implementation is as follows:

Threads: 2 - Elements / s 4.91935e+06
Threads: 4 - Elements / s 4.70822e+06
Threads: 8 - Elements / s 4.53945e+06
Threads: 16 - Elements / s 3.9192e+06
Threads: 32 - Elements / s 3.15427e+06

Not as good of throughput under low thread contention, but much higher throughput even than the std::vector wrapped stack under high thread contention. Pretty good results. The drawback here is that we are only using 6 bits for tagging(counter). Under high thread count or other situations which cause a large reuse of the same address this is insufficient.

VI. DWCAS Implementation

Can we improve on the tag pointer approach? Most modern processors support a double the architecture bit width atomic swap. For x64 this would be an atomic 128 bit swap. This gives us 64 bits for a counter which is sufficient for years on modern systems. Let's look at an implementation that uses DWCAS which is not to be confused with DCAS which is atomically changing two memory locations see Wikipedia DCAS/DWCAS.

// Stack
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackD
{
public:
    // typedef's
    typedef DataType data_type;

    /// The default constructor
    StackD() = default;

    /// Destructor
    ~StackD();

    StackD(const StackD&) = delete;
    StackD& operator=(const StackD&) = delete;
    StackD(StackD&&) noexcept = delete;
    StackD& operator=(StackD&&) noexcept = delete;

    /// Insert an element into the stack
    /// @tparam InsertType The type of the element to insert
    /// @param element The element to be inserted
    template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
    inline void push(InsertType&& element);

    /// Remove an element from the stack
    /// @return Returns the optional element if successful  
    [[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;

private:
    struct Node;

    struct TagPtr
    {
        TagPtr(Node* ptr_ = nullptr, uintptr_t tag_ = 0U) : ptr{ ptr_ }, tag{ tag_ } {};
        Node* ptr;
        uintptr_t tag;
    };

    struct Node
    {
        explicit Node(data_type value_, TagPtr next_ = TagPtr{}) : value{ std::move(value_) }, next{next_} {}
        data_type value;
        alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr next;
    };

    alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr m_Head{};
    static_assert(std::atomic_ref<TagPtr>::is_always_lock_free);
 };

template <typename DataType>
StackD<DataType>::~StackD()
{
    while (pop()) {};
}

template <typename DataType>
template <typename InsertType, typename>
inline void StackD<DataType>::push(InsertType&& element)
{
    // create the new node
    TagPtr newHead{ new Node{ std::forward<InsertType>(element) }};
    TagPtr head{ std::atomic_ref<TagPtr>{ m_Head }.load(std::memory_order_acquire) };
    do
    {
        newHead.ptr->next = TagPtr{ head.ptr };
        newHead.tag = head.tag + 1U; // ABA Counter
    } while (!std::atomic_ref<TagPtr>{ m_Head }.compare_exchange_weak(head, newHead, std::memory_order_release,
                                                                      std::memory_order_acquire));                               
}

template <typename DataType>
[[nodiscard]] inline auto StackD<DataType>::pop() noexcept -> std::optional<data_type>
{
    TagPtr head{ std::atomic_ref<TagPtr>{ m_Head }.load(std::memory_order_acquire) };
    TagPtr newHead{};
    do
    {
        if (!head.ptr)
        {
            return std::nullopt;
        }
        // ABA Counter
        newHead = TagPtr{ head.ptr->next.ptr, head.tag + 1U };
    } while (!std::atomic_ref<TagPtr>{ m_Head }.compare_exchange_weak(head, newHead, std::memory_order_release,
                                                                      std::memory_order_acquire));

    // return the data
    data_type value{ std::move(head.ptr->value) };
    //delete head.ptr;
    return std::move(value);
}

The changes to the original structure is as follows:

    struct TagPtr
    {
        TagPtr(Node* ptr_ = nullptr, uintptr_t tag_ = 0U) : ptr{ ptr_ }, tag{ tag_ } {};
        Node* ptr;
        uintptr_t tag;
    };

    struct Node
    {
        explicit Node(data_type value_, TagPtr next_ = TagPtr{}) : value{ std::move(value_) }, next{next_} {}
        data_type value;
        alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr next;
    };

    alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr m_Head{};
    static_assert(std::atomic_ref<TagPtr>::is_always_lock_free);

Why not the following:

struct alignas(2 * sizeof(void*) TagPtr
{
....
}
std::atomic<TagPtr> m_Head

Because this version fails the is_always_lock_free. C++ 20 std::atomic_ref comes to the rescue here and guarantees the atomicity of the operations. For clang it may be necessary to supply the "-mcx16" flag during compilation.

A. Push routine

The push routine differs only slightly from the last implementation in that we not implement the counter as a normal addition.

        newHead.ptr->next = TagPtr{ head.ptr };
        newHead.tag = head.tag + 1U; // ABA Counter

B. Pop routine

The pop routine is simpler in that we just set the next pointer and update the counter.

        if (!head.ptr)
        {
            return std::nullopt;
        }
        // ABA Counter
        newHead = TagPtr{ head.ptr->next.ptr, head.tag + 1U };

C. Results

The results of this robust implementation are as follows:

Threads: 2 - Elements / s 4.61971e+06
Threads: 4 - Elements / s 4.85694e+06
Threads: 8 - Elements / s 4.16947e+06
Threads: 16 - Elements / s 2.82332e+06
Threads: 32 - Elements / s 2.1455e+06

This is pretty much what we would expect. We expect this to be slightly slower than the tagged pointer address since we are atomically swapping a larger memory segment each time. It still outperforms the vector approach under high contention.

VII. Conclusion

There are other approaches to this problem such as hazard pointers which is a C++ 26 proposed feature see Hazard Pointer, deferred memory reclamation(garbage collection), and node recycling(i.e. reusing Nodes from a memory pool such as a FIFO(queue) in combination with the LIFO(stack). Time permitting I will produce a new article with some of these options.

We have learned that although lock free programming seems intuitive it is fraught with issues. Although in systems where latency is extremely important, the increased complexity is warranted.

要查看或添加评论，请登录

Bryan Zimmerman的更多文章

Dynamic Prefix Sums

2025年3月2日

Dynamic Prefix Sums

Today I want to discuss the dynamic prefix sum problem. For those unfamiliar with prefix sums there is a good overview…
C++ Lock-Free Circular Queues

2025年2月22日

C++ Lock-Free Circular Queues

I. Overview Lock-free circulars queues are a form of wait-free queues that are used in a lot of low-latency…
Intricacies of C++ Callbacks

2024年7月16日

Intricacies of C++ Callbacks

I. Overview Recently I have been working on a C++ library which wraps a non HFT third party financial asynchronous API…
The Importance of Variable Scope

2021年9月2日

The Importance of Variable Scope

I. Problem Statement Variable scope in C++ is an important issue to understand when troubleshooting intermittent…
C++ to Punn or not to Punn

2020年3月22日

C++ to Punn or not to Punn

Punning is defined on Wikipedia as "In computer science, type punning is a common term for any programming technique…
Ruminations on Rule of 5

2020年2月1日

Ruminations on Rule of 5

I. Beginning In this article I want to give an overview of how defining copy and move constructors or not impacts code…

1 条评论
C++Object Suicide

2019年12月24日

C++Object Suicide

Just like in real life, suicide of C++ objects is fraught with peril and in general should be avoided. The ISO C++…
The Abuse of C++ auto

2017年10月16日

The Abuse of C++ auto

I have been a modern C++ programmer almost as long as the C++11 standard has been out, and let me preface by saying…
Efficient Union Merge

2017年1月22日

Efficient Union Merge

This article is an investigation of an optimal algorithm for the union merging of sorted ranges to produce a sorted…
The Pitfalls of Agile Methodology

2016年9月20日

The Pitfalls of Agile Methodology

Having been involved with the creation of processes at several small organizations as well as discussions with other…

See all articles

I. Overview

II. Wrapping Standard Library Stack

III. Wrapping Standard Library Forward List

IV. Initial Implementation

A. Push routine

B. Pop routine

C. Results

V. Implementation With Tagged Pointers

A. Push routine

B. Pop routine

C. Results

VI. DWCAS Implementation

A. Push routine

B. Pop routine

C. Results

VII. Conclusion

Bryan Zimmerman的更多文章

Dynamic Prefix Sums

C++ Lock-Free Circular Queues

Intricacies of C++ Callbacks

The Importance of Variable Scope

C++ to Punn or not to Punn

Ruminations on Rule of 5

C++Object Suicide

The Abuse of C++ auto

Efficient Union Merge

The Pitfalls of Agile Methodology