登录查看更多内容

Lock-Free shared_ptr

Colman M.

Software Developer

发布日期: 2024年11月26日

Use Lock-Free Reference Counting

Spinlocks, while effective, can be too slow for HFT. Instead, a lock-free reference counting mechanism can be used, leveraging atomic operations.

template <typename T>
class ControlBlock {
public:
    T* ptr;
    std::atomic<int> strong_count;
    std::atomic<int> weak_count;

    ControlBlock(T* resource)
        : ptr(resource), strong_count(1), weak_count(0) {}

    ~ControlBlock() {
        delete ptr; // Clean up the resource
    }
};

template <typename T>
class SharedPtr {
private:
    ControlBlock<T>* control;

    void release() {
        if (control && control->strong_count.fetch_sub(1, std::memory_order_acq_rel) == 1) {
            delete control->ptr;
            if (control->weak_count.load(std::memory_order_acquire) == 0) {
                delete control;
            }
        }
    }

public:
    SharedPtr(T* resource)
        : control(new ControlBlock<T>(resource)) {}

    ~SharedPtr() {
        release();
    }
};

std::memory_order_acq_rel ensures proper memory visibility during concurrent access.

Eliminates the need for spinlocks or mutexes.

CPU Affinity for Threads

Workloads benefit from assigning threads to specific CPU cores to minimize context switching and cache misses.

#include <pthread.h>
#include <sched.h>
#include <thread>

void set_thread_affinity(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);

    pthread_t current_thread = pthread_self();
    int result = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        perror("pthread_setaffinity_np");
    }
}

// Example usage
void hft_thread_function() {
    set_thread_affinity(2); // Bind this thread to core 2
    // Perform HFT tasks here
}

Use pthread_setaffinity_np on Linux for precise CPU core control.

Match core affinity with NUMA (Non-Uniform Memory Access) considerations for optimal memory locality.

Use a Low-Latency Spinlock

For spinning scenarios, implement a low-latency spinlock that minimizes contention.

#include <atomic>
#include <thread>

class Spinlock {
private:
    std::atomic_flag lock_flag = ATOMIC_FLAG_INIT;

public:
    void lock() {
        while (lock_flag.test_and_set(std::memory_order_acquire)) {
            // Pause instruction to reduce contention
            std::this_thread::yield();
        }
    }

    void unlock() {
        lock_flag.clear(std::memory_order_release);
    }
};

Using std::this_thread::yield reduces contention by allowing other threads to proceed.

Use PAUSE (x86) or YIELD (ARM) intrinsics for CPU-specific optimizations.

Optimize Memory Allocations with Pooling

Dynamic memory allocation can introduce unpredictable latencies. A memory pool can be used to pre-allocate and recycle resources.

#include <vector>
#include <mutex>

template <typename T>
class MemoryPool {
private:
    std::vector<T*> pool;
    std::mutex pool_mutex;

public:
    ~MemoryPool() {
        for (auto ptr : pool) {
            delete ptr;
        }
    }

    T* allocate() {
        std::lock_guard<std::mutex> lock(pool_mutex);
        if (!pool.empty()) {
            T* ptr = pool.back();
            pool.pop_back();
            return ptr;
        }
        return new T();
    }

    void deallocate(T* ptr) {
        std::lock_guard<std::mutex> lock(pool_mutex);
        pool.push_back(ptr);
    }
};

Memory pooling minimizes allocations during critical trading events.

领英推荐

How and Why RISC Architectures Took Over from CISC…

Peter Smulovics 2 个月前

Difference Between CPU and GPU: CPU vs GPU

Cantech Networks 4 个月前

Demystifying Memory Sub-systems Part 2: Virtual Memory

Simon Southwell 2 年前

Use thread-local storage (TLS) to reduce mutex contention in multithreaded environments.

Fine-Tune Spin-Based Synchronization

For very short critical sections, a spinlock may still be ideal. Pair it with backoff strategies to avoid wasting CPU cycles.

void spin_with_backoff(std::atomic<bool>& flag) {
    int backoff = 1;
    while (flag.test_and_set(std::memory_order_acquire)) {
        std::this_thread::sleep_for(std::chrono::nanoseconds(backoff));
        backoff = std::min(backoff * 2, 1024); // Cap the backoff
    }
    flag.clear(std::memory_order_release);
}

NUMA-Aware Memory Allocations

HFT systems with NUMA architectures require careful memory placement to minimize cross-node latencies.

#include <numa.h>
#include <numaif.h>

void* allocate_on_node(size_t size, int node) {
    void* ptr = numa_alloc_onnode(size, node);
    if (!ptr) {
        perror("numa_alloc_onnode");
        throw std::bad_alloc();
    }
    return ptr;
}

void free_numa_memory(void* ptr, size_t size) {
    numa_free(ptr, size);
}

Allocate memory close to the thread/core accessing it.

Combine with thread affinity for maximum locality.

Batch Memory Deallocation

Deallocate memory in batches to reduce overhead from individual calls.

template <typename T>
class BatchDeallocator {
private:
    std::vector<T*> batch;
    size_t batch_size;

public:
    explicit BatchDeallocator(size_t size) : batch_size(size) {}

    void deallocate(T* ptr) {
        batch.push_back(ptr);
        if (batch.size() >= batch_size) {
            flush();
        }
    }

    void flush() {
        for (auto ptr : batch) {
            delete ptr;
        }
        batch.clear();
    }

    ~BatchDeallocator() {
        flush();
    }
};

Inline small functions to eliminate function call overhead.

Use constexpr for compile-time computations.

Avoid virtual function calls where possible.

Lock-free data structures minimize latency.
CPU affinity and NUMA-aware allocations reduce cross-node traffic.
Memory pooling and batch deallocation handle allocations predictably.
A finely-tuned spinlock with backoff ensures contention is managed efficiently.

要查看或添加评论，请登录

Colman M.的更多文章

CPU Optimization

2025年1月16日

CPU Optimization

Super Scalar & SIMD Architectures: Modern CPUs can handle more operations per cycle, but only if workloads are…
T7 EOBI with a Custom SharedPtr

2024年11月26日

T7 EOBI with a Custom SharedPtr

Setting Up Custom Shared Pointer A manages order book updates and execution data coming from the T7 EOBI feed, allowing…
Building a Compliance Module

2024年11月26日

Building a Compliance Module

Key Features for Compliance in HFT Order Validation: Ensure all orders comply with regulatory rules (e.g.
Warming Up an HFT System: Pre-Trading with a Custom SharedPtr and QuantLib

2024年11月26日

Warming Up an HFT System: Pre-Trading with a Custom SharedPtr and QuantLib

HFT systems demand extreme performance and reliability. Before the trading day begins, these systems often require a…
Order Book with Custom shared_ptr

2024年11月26日

Order Book with Custom shared_ptr

Shared Order Representation Use to manage orders efficiently and safely across multiple threads. Lock-Free Order Book A…
Build a shared_ptr

2024年11月26日

Build a shared_ptr

Define the Control Block with Atomic Reference Counting Use atomic integers for thread-safe reference counting…
To turn AWS-based trading systems on/off or to dynamic

2024年11月24日

To turn AWS-based trading systems on/off or to dynamic

EC2 Instances for Trading Infrastructure Turn Down Trading System Terminate EC2 Instances Move Trading System to a New…
Unifying Market Data Formats Across Global Exchanges

2024年11月24日

Unifying Market Data Formats Across Global Exchanges

Market data integration is a cornerstone of building efficient and robust trading systems. Exchanges like Deutsche…

3 条评论
Trading Strategies: From Simplicity to Code

2024年11月24日

Trading Strategies: From Simplicity to Code

Mean-Reversion When you stretch a rubber band (price goes up or down a lot), it wants to snap back to its normal shape.…
Outsourcing the Dev Lifecycle to AI

2024年10月16日

Outsourcing the Dev Lifecycle to AI

This would essentially involve an AI that has complete control over the entire software development lifecycle. This AI…

1 条评论

See all articles

Lock-Free shared_ptr

Colman M.

Software Developer

Use Lock-Free Reference Counting

CPU Affinity for Threads

Use a Low-Latency Spinlock

Optimize Memory Allocations with Pooling

领英推荐

Fine-Tune Spin-Based Synchronization

NUMA-Aware Memory Allocations

Batch Memory Deallocation

Colman M.的更多文章

社区洞察

其他会员也浏览了

Navigating the CPU: Understanding Execution Times, Challenges, Efficiency, Troubleshooting, and Task Distinctions part II

Polymorphic Allocators in C++17

x86 protected mode and Long Mode x86-64 and the equivalents on ARM.

Of Dials and Switches -- Part III: Turning Dials, Flipping Switches

Managing Kubernetes Resource Limits

When Memory Runs Dry: Understanding the OOM Killer’s Decision Process

Kubernetes Speed and CPU Speed - Here's a Dynamite Connection

Cache-Aware Memory Allocation Techniques for RTOS

Understanding current->pagefault_disabled in Linux Kernel (x86 Architecture)

The RISC-V Revolution: Why The Global Tech Community Needs To Pay More Attention To This

Use Lock-Free Reference Counting

CPU Affinity for Threads

Use a Low-Latency Spinlock

Optimize Memory Allocations with Pooling

领英推荐

Fine-Tune Spin-Based Synchronization

NUMA-Aware Memory Allocations

Batch Memory Deallocation

Colman M.的更多文章

CPU Optimization

T7 EOBI with a Custom SharedPtr

Building a Compliance Module

Warming Up an HFT System: Pre-Trading with a Custom SharedPtr and QuantLib

Order Book with Custom shared_ptr

Build a shared_ptr

To turn AWS-based trading systems on/off or to dynamic

Unifying Market Data Formats Across Global Exchanges

Trading Strategies: From Simplicity to Code

Outsourcing the Dev Lifecycle to AI

社区洞察

其他会员也浏览了

Navigating the CPU: Understanding Execution Times, Challenges, Efficiency, Troubleshooting, and Task Distinctions part II

Polymorphic Allocators in C++17

x86 protected mode and Long Mode x86-64 and the equivalents on ARM.

Of Dials and Switches -- Part III: Turning Dials, Flipping Switches

Managing Kubernetes Resource Limits

When Memory Runs Dry: Understanding the OOM Killer’s Decision Process

Kubernetes Speed and CPU Speed - Here's a Dynamite Connection

Cache-Aware Memory Allocation Techniques for RTOS

Understanding current->pagefault_disabled in Linux Kernel (x86 Architecture)

The RISC-V Revolution: Why The Global Tech Community Needs To Pay More Attention To This