C++ Lock Free Stack
Bryan Zimmerman
Expert modern C++ 23+ low latency HPC Software, Hardware, FPGA, and RF Engineer, Consultant, BAZ Innovations, LLC Owner
I. Overview
What I want to write about today is the creation of a lock free stack. At first glance this seems not so hard; however, there is no learning but doing. A stack is a FIFO(first in, first out) container that normally supports the following operations push(put an element onto the stack) and pop(take an element off of the stack). Wikipedia Stack Overview. Why would you want to use a lock free stack? The short answer is if you want a lower latency system by eliminating context switches. Lower latency does not necessarily translate to higher throughput; and in order to have determined latency the stack would also need to be wait-free which is a much more difficult problem.
II. Wrapping Standard Library Stack
In order to see where we stand if we take the simple approach; although not lock free we will wrap the std::stack from the C++ standard library with a std::mutex for synchronization between multiple threads. One could also use a std::shared_mutex if it was read heavy versus write since a std::shared_mutex is non blocking to multiple simultaneous readers.
Our implementation looks like the following:
#include <cstdint>
#include <cstdlib>
#include <mutex>
#include <optional>
#include <stack>
#include <type_traits>
/// Stack wrapping std::stack
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackLock
{
public:
// typedef's
typedef DataType data_type;
/// The default constructor
StackLock() = default;
/// Destructor
~StackLock();
StackLock(const StackLock&) = delete;
StackLock& operator=(const StackLock&) = delete;
StackLock(StackLock&&) noexcept = delete;
StackLock& operator=(StackLock&&) noexcept = delete;
/// Insert an element into the stack
/// @tparam InsertType The type of the element to insert
/// @param element The element to be inserted
template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
inline void push(InsertType&& element);
/// Remove an element from the stack
/// @return Returns the optional element if successful
[[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;
private:
std::mutex m_Lock;
std::stack<data_type> m_Stack;
};
template <typename DataType>
StackLock<DataType>::~StackLock()
{
while (!m_Stack.empty())
{
m_Stack.pop();
}
}
template <typename DataType>
template <typename InsertType, typename>
inline void StackLock<DataType>::push(InsertType&& element)
{
std::unique_lock lock{ m_Lock };
m_Stack.emplace(std::forward<InsertType>(element));
}
template <typename DataType>
[[nodiscard]] inline auto StackLock<DataType>::pop() noexcept -> std::optional<data_type>
{
std::unique_lock lock{ m_Lock };
if (m_Stack.empty())
{
return std::nullopt;
}
auto val {std::move(m_Stack.top()) };
m_Stack.pop();
return std::move(val);
}
This is pretty simple in that we just use a std::unique_lock in both the push and the pop which locks when created and releases when going out of scope. The benchmarks with the caveat that these are taken with verification tracking code built in. So if we wanted true benchmarks we would micro benchmark without verification code. Also the number of readers and writers are equal, i.e. 50% read and 50% write.
Threads: 2 - Elements / s 1.36081e+07
Threads: 4 - Elements / s 7.08468e+06
Threads: 8 - Elements / s 4.76185e+06
Threads: 16 - Elements / s 2.88997e+06
Threads: 32 - Elements / s 1.23162e+06
This is somewhat unfair since the std::stack is normally based on a vector since it is very efficient to put and take off and element from the end. Vectors also have natural cache coherency. Our implementations are going to be based on a forward_list so lets look at that one.
III. Wrapping Standard Library Forward List
We will take the same approach of wrapping the pop and push routines with std::unique_lock. Our code is as follows:
/// A wrapped std::forward_list
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackLockList
{
public:
// typedef's
typedef DataType data_type;
/// The default constructor
StackLockList() = default;
/// Destructor
~StackLockList();
StackLockList(const StackLockList&) = delete;
StackLockList& operator=(const StackLockList&) = delete;
StackLockList(StackLockList&&) noexcept = delete;
StackLockList& operator=(StackLockList&&) noexcept = delete;
/// Insert an element into the stack
/// @tparam InsertType The type of the element to insert
/// @param element The element to be inserted
template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
inline void push(InsertType&& element);
/// Remove an element from the stack
/// @return Returns the optional element if successful
[[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;
private:
std::mutex m_Lock;
std::forward_list<data_type> m_Stack;
};
template <typename DataType>
StackLockList<DataType>::~StackLockList()
{
while (!m_Stack.empty())
{
m_Stack.pop_front();
}
}
template <typename DataType>
template <typename InsertType, typename>
inline void StackLockList<DataType>::push(InsertType&& element)
{
std::unique_lock lock{ m_Lock };
m_Stack.emplace_front(std::forward<InsertType>(element));
}
template <typename DataType>
[[nodiscard]] inline auto StackLockList<DataType>::pop() noexcept -> std::optional<data_type>
{
std::unique_lock lock{ m_Lock };
if (m_Stack.empty())
{
return std::nullopt;
}
auto val {std::move(m_Stack.front()) };
m_Stack.pop_front();
return std::move(val);
}
The results of this are not as good as wrapping the std::stack which we would expect but pretty decent except under very high contention where it falls off sharply.
Threads: 2 - Elements / s 8.68647e+06
Threads: 4 - Elements / s 6.73833e+06
Threads: 8 - Elements / s 3.96009e+06
Threads: 16 - Elements / s 2.08256e+06
Threads: 32 - Elements / s 640039
IV. Initial Implementation
The naive approach which is prevalent on the web follows along the lines of something like:
#include <atomic>
#include <cassert>
#include <cstdint>
#include <cstdlib>
#include <optional>
#include <type_traits>
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class Stack
{
public:
// typedef's
typedef DataType data_type;
/// The default constructor
Stack() = default;
/// Destructor
~Stack();
Stack(const Stack&) = delete;
Stack& operator=(const Stack&) = delete;
Stack(Stack&&) noexcept = delete;
Stack& operator=(Stack&&) noexcept = delete;
/// Insert an element into the stack
/// @tparam InsertType The type of the element to insert
/// @param element The element to be inserted
template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
inline void push(InsertType&& element);
/// Remove an element from the stack
/// @return Returns the optional element if successful
[[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;
private:
struct Node
{
explicit Node(data_type value_) : value{ std::move(value_) }, pNext{ nullptr } {}
data_type value;
Node* pNext;
};
std::atomic<Node*> m_Head{ nullptr };
};
template <typename DataType>
Stack<DataType>::~Stack()
{
while(pop()) {};
}
template <typename DataType>
template <typename InsertType, typename>
inline void Stack<DataType>::push(InsertType&& element)
{
// create the new node
Node* pNewNode{ new Node{ std::forward<InsertType>(element) } };
Node* pHead{ m_Head.load(std::memory_order_acquire) };
do
{
pNewNode->pNext = pHead;
} while (!m_Head.compare_exchange_weak(pHead, pNewNode, std::memory_order_release, std::memory_order_acquire));
}
template <typename DataType>
[[nodiscard]] inline auto Stack<DataType>::pop() noexcept -> std::optional<data_type>
{
Node* pHead{ m_Head.load(std::memory_order_acquire) };
Node* pNext{ nullptr };
do
{
if (!pHead)
{
return std::nullopt;
}
pNext = pHead->pNext;
} while (!m_Head.compare_exchange_weak(pHead, pNext, std::memory_order_release, std::memory_order_acquire));
// return the data
auto value{ std::move(pHead->value) };
delete pHead;
return std::move(value);
}
Let's look this over.
A. Push routine
We create a new node with the value that we want to insert. We read the current head of the linked list. We set the new node next pointer to the current head, and then we try to atomically CAS replace the head with our new head. If it fails we get the new current head and repeat until successful.
This routine is also using what I consider optimal memory order. We read the current head with acquire memory order. If we replace the head node successfully we release that we replaced the value; otherwise we acquire the new node. I have seen relaxed used for the first read of the head as well as for the CAS failure state, but I think this is an optimistic enhancement since if the head has changed via another thread, relaxed gives no guarantee that this thread sees the update. So the CAS loop will continue until this thread sees the update.
B. Pop routine
We read the current head. If it is null then the stack is empty and we return. If not empty, we try to replace the head node with the next node that the head points to. We loop in CAS until the stack is empty of we successfully replace the head with the next node.
If we replaced the node then we move out the value, delete the node, and return the value. This is still pretty simple and the memory order follows the same as the push.
C. Results
This looks fairly simple in order to go lock free. And running this code will normally result in one of the following outcomes:
The reason behind this unexpected result is something called the ABA problem. See Wikipedia ABA Problem. The main reason for this being that memory addresses are recycled during new operations because this improves cache coherency and other factors.
V. Implementation With Tagged Pointers
One of the ways around the ABA problem is to associate a counter with the pointer which is incremented when the pointer is modified. Let's look an implementation where we use the lower bits of address as a counter.
#include <atomic>
#include <cassert>
#include <cstdint>
#include <cstdlib>
#include <optional>
#include <type_traits>
/// @tparam DataType The data type of the elements stored in the circular queue
template<typename DataType>
class Stack
{
public:
// typedef's
typedef DataType data_type;
/// The default constructor
Stack() = default;
/// Destructor
~Stack();
Stack(const Stack&) = delete;
Stack& operator=(const Stack&) = delete;
Stack(Stack&&) noexcept = delete;
Stack& operator=(Stack&&) noexcept = delete;
/// Insert an element into the stack
/// @tparam InsertType The type of the element to insert
/// @param element The element to be inserted
template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
inline void push(InsertType&& element);
/// Remove an element from the stack
/// @return Returns the optional element if successful
[[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;
private:
static constexpr size_t s_Alignment{ 64U };
static constexpr uintptr_t s_AlignMask{ static_cast<uintptr_t>(s_Alignment - 1U) };
static constexpr uintptr_t s_BaseMask{ ~s_AlignMask };
struct alignas(s_Alignment) Node
{
explicit Node(data_type value_) : value{ std::move(value_) }, pNext{ nullptr } {}
data_type value;
Node* pNext;
};
std::atomic<Node*> m_Head{ nullptr };
};
template <typename DataType>
Stack<DataType>::~Stack()
{
while (pop()) {}
}
template <typename DataType>
template <typename InsertType, typename>
inline void Stack<DataType>::push(InsertType&& element)
{
// create the new node
Node* pNodeBase{ new Node{ std::forward<InsertType>(element) } };
Node* pNode{ nullptr };
Node* pHead{ m_Head.load(std::memory_order_acquire) };
do
{
pNodeBase->pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
// ABA: Node Base or'd with counter
pNode = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pNodeBase) |
((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));
} while (!m_Head.compare_exchange_weak(pHead, pNode, std::memory_order_release, std::memory_order_acquire));
}
template <typename DataType>
[[nodiscard]] inline auto Stack<DataType>::pop() noexcept -> std::optional<data_type>
{
Node* pHead{ m_Head.load(std::memory_order_acquire) };
Node* pNext{ nullptr };
Node* pBaseHead{ nullptr };
do
{
pBaseHead = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
if (!pBaseHead)
{
return std::nullopt;
}
// ABA: Node Base or'd with counter
pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pBaseHead->pNext) |
((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));
} while (!m_Head.compare_exchange_weak(pHead, pNext, std::memory_order_release, std::memory_order_acquire));
// return the data
auto value{ std::move(pBaseHead->value) };
//delete pBaseHead;
return std::move(value);
}
The changes to the structure are
static constexpr size_t s_Alignment{ 64U };
static constexpr uintptr_t s_AlignMask{ static_cast<uintptr_t>(s_Alignment - 1U) };
static constexpr uintptr_t s_BaseMask{ ~s_AlignMask };
struct alignas(s_Alignment) Node
where we define an alignment of 64 Bytes which will leave us 6 bits free to act as a counter. We then compute an alignment mask for the counter and a base mask to give us back the unadulterated pointer. We also define an alignas for the Node struct which will force it to be created on memory where the last 6 bits are always 0. If one does searches on the web for this method you will probably find cases where the upper 16 bits are used for a counter. This is non portable between architectures where as the alignas method proposed above is.
A. Push routine
The change from the naive implementation is here:
pNodeBase->pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
// ABA: Node Base or'd with counter
pNode = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pNodeBase) |
((reinterpret_cast<uintptr_t>(pHead) + 0x1U) &
We assign the new node next to the untagged head node, i.e. the real address of the next node. Then we tag the pointer with one greater than the current head pointer; since the head is what we are modifying. The rest of the routine is unchanges.
B. Pop routine
The change from the naive implementation is here:
pBaseHead = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pHead) & s_BaseMask);
if (!pBaseHead)
{
return std::nullopt;
}
// ABA: Node Base or'd with counter
pNext = reinterpret_cast<Node*>(reinterpret_cast<uintptr_t>(pBaseHead->pNext) |
((reinterpret_cast<uintptr_t>(pHead) + 0x1U) & s_AlignMask));
We get the true(untagged) address of the head node. If it is null then the stack is empty. Otherwise, we tag the next pointer with 1 greater than the current head since it is the head we are modifying. The only other significant change is that we no longer delete the node to avoid accessing memory which has been de-allocated. Obviously this will leak memory, and we will come back to this later on.
C. Results
The benchmark results of this lock free implementation is as follows:
Threads: 2 - Elements / s 4.91935e+06
Threads: 4 - Elements / s 4.70822e+06
Threads: 8 - Elements / s 4.53945e+06
Threads: 16 - Elements / s 3.9192e+06
Threads: 32 - Elements / s 3.15427e+06
Not as good of throughput under low thread contention, but much higher throughput even than the std::vector wrapped stack under high thread contention. Pretty good results. The drawback here is that we are only using 6 bits for tagging(counter). Under high thread count or other situations which cause a large reuse of the same address this is insufficient.
VI. DWCAS Implementation
Can we improve on the tag pointer approach? Most modern processors support a double the architecture bit width atomic swap. For x64 this would be an atomic 128 bit swap. This gives us 64 bits for a counter which is sufficient for years on modern systems. Let's look at an implementation that uses DWCAS which is not to be confused with DCAS which is atomically changing two memory locations see Wikipedia DCAS/DWCAS.
// Stack
/// @tparam DataType The data type of the elements stored in the stack
template<typename DataType>
class StackD
{
public:
// typedef's
typedef DataType data_type;
/// The default constructor
StackD() = default;
/// Destructor
~StackD();
StackD(const StackD&) = delete;
StackD& operator=(const StackD&) = delete;
StackD(StackD&&) noexcept = delete;
StackD& operator=(StackD&&) noexcept = delete;
/// Insert an element into the stack
/// @tparam InsertType The type of the element to insert
/// @param element The element to be inserted
template <typename InsertType, typename = std::is_convertible<InsertType, data_type>>
inline void push(InsertType&& element);
/// Remove an element from the stack
/// @return Returns the optional element if successful
[[nodiscard]] inline auto pop() noexcept -> std::optional<data_type>;
private:
struct Node;
struct TagPtr
{
TagPtr(Node* ptr_ = nullptr, uintptr_t tag_ = 0U) : ptr{ ptr_ }, tag{ tag_ } {};
Node* ptr;
uintptr_t tag;
};
struct Node
{
explicit Node(data_type value_, TagPtr next_ = TagPtr{}) : value{ std::move(value_) }, next{next_} {}
data_type value;
alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr next;
};
alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr m_Head{};
static_assert(std::atomic_ref<TagPtr>::is_always_lock_free);
};
template <typename DataType>
StackD<DataType>::~StackD()
{
while (pop()) {};
}
template <typename DataType>
template <typename InsertType, typename>
inline void StackD<DataType>::push(InsertType&& element)
{
// create the new node
TagPtr newHead{ new Node{ std::forward<InsertType>(element) }};
TagPtr head{ std::atomic_ref<TagPtr>{ m_Head }.load(std::memory_order_acquire) };
do
{
newHead.ptr->next = TagPtr{ head.ptr };
newHead.tag = head.tag + 1U; // ABA Counter
} while (!std::atomic_ref<TagPtr>{ m_Head }.compare_exchange_weak(head, newHead, std::memory_order_release,
std::memory_order_acquire));
}
template <typename DataType>
[[nodiscard]] inline auto StackD<DataType>::pop() noexcept -> std::optional<data_type>
{
TagPtr head{ std::atomic_ref<TagPtr>{ m_Head }.load(std::memory_order_acquire) };
TagPtr newHead{};
do
{
if (!head.ptr)
{
return std::nullopt;
}
// ABA Counter
newHead = TagPtr{ head.ptr->next.ptr, head.tag + 1U };
} while (!std::atomic_ref<TagPtr>{ m_Head }.compare_exchange_weak(head, newHead, std::memory_order_release,
std::memory_order_acquire));
// return the data
data_type value{ std::move(head.ptr->value) };
//delete head.ptr;
return std::move(value);
}
The changes to the original structure is as follows:
struct TagPtr
{
TagPtr(Node* ptr_ = nullptr, uintptr_t tag_ = 0U) : ptr{ ptr_ }, tag{ tag_ } {};
Node* ptr;
uintptr_t tag;
};
struct Node
{
explicit Node(data_type value_, TagPtr next_ = TagPtr{}) : value{ std::move(value_) }, next{next_} {}
data_type value;
alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr next;
};
alignas(std::atomic_ref<TagPtr>::required_alignment) TagPtr m_Head{};
static_assert(std::atomic_ref<TagPtr>::is_always_lock_free);
Why not the following:
struct alignas(2 * sizeof(void*) TagPtr
{
....
}
std::atomic<TagPtr> m_Head
Because this version fails the is_always_lock_free. C++ 20 std::atomic_ref comes to the rescue here and guarantees the atomicity of the operations. For clang it may be necessary to supply the "-mcx16" flag during compilation.
A. Push routine
The push routine differs only slightly from the last implementation in that we not implement the counter as a normal addition.
newHead.ptr->next = TagPtr{ head.ptr };
newHead.tag = head.tag + 1U; // ABA Counter
B. Pop routine
The pop routine is simpler in that we just set the next pointer and update the counter.
if (!head.ptr)
{
return std::nullopt;
}
// ABA Counter
newHead = TagPtr{ head.ptr->next.ptr, head.tag + 1U };
C. Results
The results of this robust implementation are as follows:
Threads: 2 - Elements / s 4.61971e+06
Threads: 4 - Elements / s 4.85694e+06
Threads: 8 - Elements / s 4.16947e+06
Threads: 16 - Elements / s 2.82332e+06
Threads: 32 - Elements / s 2.1455e+06
This is pretty much what we would expect. We expect this to be slightly slower than the tagged pointer address since we are atomically swapping a larger memory segment each time. It still outperforms the vector approach under high contention.
VII. Conclusion
There are other approaches to this problem such as hazard pointers which is a C++ 26 proposed feature see Hazard Pointer, deferred memory reclamation(garbage collection), and node recycling(i.e. reusing Nodes from a memory pool such as a FIFO(queue) in combination with the LIFO(stack). Time permitting I will produce a new article with some of these options.
We have learned that although lock free programming seems intuitive it is fraught with issues. Although in systems where latency is extremely important, the increased complexity is warranted.