Lock-Free shared_ptr

Use Lock-Free Reference Counting

Spinlocks, while effective, can be too slow for HFT. Instead, a lock-free reference counting mechanism can be used, leveraging atomic operations.

template <typename T>
class ControlBlock {
public:
    T* ptr;
    std::atomic<int> strong_count;
    std::atomic<int> weak_count;

    ControlBlock(T* resource)
        : ptr(resource), strong_count(1), weak_count(0) {}

    ~ControlBlock() {
        delete ptr; // Clean up the resource
    }
};

template <typename T>
class SharedPtr {
private:
    ControlBlock<T>* control;

    void release() {
        if (control && control->strong_count.fetch_sub(1, std::memory_order_acq_rel) == 1) {
            delete control->ptr;
            if (control->weak_count.load(std::memory_order_acquire) == 0) {
                delete control;
            }
        }
    }

public:
    SharedPtr(T* resource)
        : control(new ControlBlock<T>(resource)) {}

    ~SharedPtr() {
        release();
    }
};        

std::memory_order_acq_rel ensures proper memory visibility during concurrent access.

Eliminates the need for spinlocks or mutexes.

CPU Affinity for Threads

Workloads benefit from assigning threads to specific CPU cores to minimize context switching and cache misses.

#include <pthread.h>
#include <sched.h>
#include <thread>

void set_thread_affinity(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);

    pthread_t current_thread = pthread_self();
    int result = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        perror("pthread_setaffinity_np");
    }
}

// Example usage
void hft_thread_function() {
    set_thread_affinity(2); // Bind this thread to core 2
    // Perform HFT tasks here
}        

Use pthread_setaffinity_np on Linux for precise CPU core control.

Match core affinity with NUMA (Non-Uniform Memory Access) considerations for optimal memory locality.

Use a Low-Latency Spinlock

For spinning scenarios, implement a low-latency spinlock that minimizes contention.

#include <atomic>
#include <thread>

class Spinlock {
private:
    std::atomic_flag lock_flag = ATOMIC_FLAG_INIT;

public:
    void lock() {
        while (lock_flag.test_and_set(std::memory_order_acquire)) {
            // Pause instruction to reduce contention
            std::this_thread::yield();
        }
    }

    void unlock() {
        lock_flag.clear(std::memory_order_release);
    }
};        

Using std::this_thread::yield reduces contention by allowing other threads to proceed.

Use PAUSE (x86) or YIELD (ARM) intrinsics for CPU-specific optimizations.

Optimize Memory Allocations with Pooling

Dynamic memory allocation can introduce unpredictable latencies. A memory pool can be used to pre-allocate and recycle resources.

#include <vector>
#include <mutex>

template <typename T>
class MemoryPool {
private:
    std::vector<T*> pool;
    std::mutex pool_mutex;

public:
    ~MemoryPool() {
        for (auto ptr : pool) {
            delete ptr;
        }
    }

    T* allocate() {
        std::lock_guard<std::mutex> lock(pool_mutex);
        if (!pool.empty()) {
            T* ptr = pool.back();
            pool.pop_back();
            return ptr;
        }
        return new T();
    }

    void deallocate(T* ptr) {
        std::lock_guard<std::mutex> lock(pool_mutex);
        pool.push_back(ptr);
    }
};        

Memory pooling minimizes allocations during critical trading events.

Use thread-local storage (TLS) to reduce mutex contention in multithreaded environments.


Fine-Tune Spin-Based Synchronization

For very short critical sections, a spinlock may still be ideal. Pair it with backoff strategies to avoid wasting CPU cycles.

void spin_with_backoff(std::atomic<bool>& flag) {
    int backoff = 1;
    while (flag.test_and_set(std::memory_order_acquire)) {
        std::this_thread::sleep_for(std::chrono::nanoseconds(backoff));
        backoff = std::min(backoff * 2, 1024); // Cap the backoff
    }
    flag.clear(std::memory_order_release);
}        

NUMA-Aware Memory Allocations

HFT systems with NUMA architectures require careful memory placement to minimize cross-node latencies.

#include <numa.h>
#include <numaif.h>

void* allocate_on_node(size_t size, int node) {
    void* ptr = numa_alloc_onnode(size, node);
    if (!ptr) {
        perror("numa_alloc_onnode");
        throw std::bad_alloc();
    }
    return ptr;
}

void free_numa_memory(void* ptr, size_t size) {
    numa_free(ptr, size);
}        

Allocate memory close to the thread/core accessing it.

Combine with thread affinity for maximum locality.

Batch Memory Deallocation

Deallocate memory in batches to reduce overhead from individual calls.

template <typename T>
class BatchDeallocator {
private:
    std::vector<T*> batch;
    size_t batch_size;

public:
    explicit BatchDeallocator(size_t size) : batch_size(size) {}

    void deallocate(T* ptr) {
        batch.push_back(ptr);
        if (batch.size() >= batch_size) {
            flush();
        }
    }

    void flush() {
        for (auto ptr : batch) {
            delete ptr;
        }
        batch.clear();
    }

    ~BatchDeallocator() {
        flush();
    }
};        

Inline small functions to eliminate function call overhead.

Use constexpr for compile-time computations.

Avoid virtual function calls where possible.


  • Lock-free data structures minimize latency.
  • CPU affinity and NUMA-aware allocations reduce cross-node traffic.
  • Memory pooling and batch deallocation handle allocations predictably.
  • A finely-tuned spinlock with backoff ensures contention is managed efficiently.

要查看或添加评论,请登录

Colman M.的更多文章

  • CPU Optimization

    CPU Optimization

    Super Scalar & SIMD Architectures: Modern CPUs can handle more operations per cycle, but only if workloads are…

  • T7 EOBI with a Custom SharedPtr

    T7 EOBI with a Custom SharedPtr

    Setting Up Custom Shared Pointer A manages order book updates and execution data coming from the T7 EOBI feed, allowing…

  • Building a Compliance Module

    Building a Compliance Module

    Key Features for Compliance in HFT Order Validation: Ensure all orders comply with regulatory rules (e.g.

  • Warming Up an HFT System: Pre-Trading with a Custom SharedPtr and QuantLib

    Warming Up an HFT System: Pre-Trading with a Custom SharedPtr and QuantLib

    HFT systems demand extreme performance and reliability. Before the trading day begins, these systems often require a…

  • Order Book with Custom shared_ptr

    Order Book with Custom shared_ptr

    Shared Order Representation Use to manage orders efficiently and safely across multiple threads. Lock-Free Order Book A…

  • Build a shared_ptr

    Build a shared_ptr

    Define the Control Block with Atomic Reference Counting Use atomic integers for thread-safe reference counting…

  • To turn AWS-based trading systems on/off or to dynamic

    To turn AWS-based trading systems on/off or to dynamic

    EC2 Instances for Trading Infrastructure Turn Down Trading System Terminate EC2 Instances Move Trading System to a New…

  • Unifying Market Data Formats Across Global Exchanges

    Unifying Market Data Formats Across Global Exchanges

    Market data integration is a cornerstone of building efficient and robust trading systems. Exchanges like Deutsche…

    3 条评论
  • Trading Strategies: From Simplicity to Code

    Trading Strategies: From Simplicity to Code

    Mean-Reversion When you stretch a rubber band (price goes up or down a lot), it wants to snap back to its normal shape.…

  • Outsourcing the Dev Lifecycle to AI

    Outsourcing the Dev Lifecycle to AI

    This would essentially involve an AI that has complete control over the entire software development lifecycle. This AI…

    1 条评论

社区洞察

其他会员也浏览了