Lock-Free shared_ptr
Use Lock-Free Reference Counting
Spinlocks, while effective, can be too slow for HFT. Instead, a lock-free reference counting mechanism can be used, leveraging atomic operations.
template <typename T>
class ControlBlock {
public:
T* ptr;
std::atomic<int> strong_count;
std::atomic<int> weak_count;
ControlBlock(T* resource)
: ptr(resource), strong_count(1), weak_count(0) {}
~ControlBlock() {
delete ptr; // Clean up the resource
}
};
template <typename T>
class SharedPtr {
private:
ControlBlock<T>* control;
void release() {
if (control && control->strong_count.fetch_sub(1, std::memory_order_acq_rel) == 1) {
delete control->ptr;
if (control->weak_count.load(std::memory_order_acquire) == 0) {
delete control;
}
}
}
public:
SharedPtr(T* resource)
: control(new ControlBlock<T>(resource)) {}
~SharedPtr() {
release();
}
};
std::memory_order_acq_rel ensures proper memory visibility during concurrent access.
Eliminates the need for spinlocks or mutexes.
CPU Affinity for Threads
Workloads benefit from assigning threads to specific CPU cores to minimize context switching and cache misses.
#include <pthread.h>
#include <sched.h>
#include <thread>
void set_thread_affinity(int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_t current_thread = pthread_self();
int result = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
if (result != 0) {
perror("pthread_setaffinity_np");
}
}
// Example usage
void hft_thread_function() {
set_thread_affinity(2); // Bind this thread to core 2
// Perform HFT tasks here
}
Use pthread_setaffinity_np on Linux for precise CPU core control.
Match core affinity with NUMA (Non-Uniform Memory Access) considerations for optimal memory locality.
Use a Low-Latency Spinlock
For spinning scenarios, implement a low-latency spinlock that minimizes contention.
#include <atomic>
#include <thread>
class Spinlock {
private:
std::atomic_flag lock_flag = ATOMIC_FLAG_INIT;
public:
void lock() {
while (lock_flag.test_and_set(std::memory_order_acquire)) {
// Pause instruction to reduce contention
std::this_thread::yield();
}
}
void unlock() {
lock_flag.clear(std::memory_order_release);
}
};
Using std::this_thread::yield reduces contention by allowing other threads to proceed.
Use PAUSE (x86) or YIELD (ARM) intrinsics for CPU-specific optimizations.
Optimize Memory Allocations with Pooling
Dynamic memory allocation can introduce unpredictable latencies. A memory pool can be used to pre-allocate and recycle resources.
#include <vector>
#include <mutex>
template <typename T>
class MemoryPool {
private:
std::vector<T*> pool;
std::mutex pool_mutex;
public:
~MemoryPool() {
for (auto ptr : pool) {
delete ptr;
}
}
T* allocate() {
std::lock_guard<std::mutex> lock(pool_mutex);
if (!pool.empty()) {
T* ptr = pool.back();
pool.pop_back();
return ptr;
}
return new T();
}
void deallocate(T* ptr) {
std::lock_guard<std::mutex> lock(pool_mutex);
pool.push_back(ptr);
}
};
Memory pooling minimizes allocations during critical trading events.
领英推荐
Use thread-local storage (TLS) to reduce mutex contention in multithreaded environments.
Fine-Tune Spin-Based Synchronization
For very short critical sections, a spinlock may still be ideal. Pair it with backoff strategies to avoid wasting CPU cycles.
void spin_with_backoff(std::atomic<bool>& flag) {
int backoff = 1;
while (flag.test_and_set(std::memory_order_acquire)) {
std::this_thread::sleep_for(std::chrono::nanoseconds(backoff));
backoff = std::min(backoff * 2, 1024); // Cap the backoff
}
flag.clear(std::memory_order_release);
}
NUMA-Aware Memory Allocations
HFT systems with NUMA architectures require careful memory placement to minimize cross-node latencies.
#include <numa.h>
#include <numaif.h>
void* allocate_on_node(size_t size, int node) {
void* ptr = numa_alloc_onnode(size, node);
if (!ptr) {
perror("numa_alloc_onnode");
throw std::bad_alloc();
}
return ptr;
}
void free_numa_memory(void* ptr, size_t size) {
numa_free(ptr, size);
}
Allocate memory close to the thread/core accessing it.
Combine with thread affinity for maximum locality.
Batch Memory Deallocation
Deallocate memory in batches to reduce overhead from individual calls.
template <typename T>
class BatchDeallocator {
private:
std::vector<T*> batch;
size_t batch_size;
public:
explicit BatchDeallocator(size_t size) : batch_size(size) {}
void deallocate(T* ptr) {
batch.push_back(ptr);
if (batch.size() >= batch_size) {
flush();
}
}
void flush() {
for (auto ptr : batch) {
delete ptr;
}
batch.clear();
}
~BatchDeallocator() {
flush();
}
};
Inline small functions to eliminate function call overhead.
Use constexpr for compile-time computations.
Avoid virtual function calls where possible.