登录查看更多内容

3FS: A Technical Look at AI's Memory Solution

Maksym Huczynski

Building Visual AI Agents @ Harpagan.com

发布日期: 2025年2月28日

Large language models face a fundamental challenge: generating each new token requires access to all previous tokens' key-value pairs. This memory requirement scales with context length, creating an expensive bottleneck in AI inference. The Fire-Flyer File System (3FS) offers a technical solution by reimagining where and how these KV pairs are stored.

The KV-Cache Challenge

Modern transformer models compute key (K) and value (V) vectors for each token in each attention head across dozens of layers. A typical model with 32 layers and 96 attention heads generates 2-4KB of KV data per token. With 32K token contexts becoming standard, that's up to 128MB per user session—multiplied by thousands of concurrent users.

Traditional solutions rely on expensive DRAM, but 3FS takes a different approach: distributing KV-Cache across NVMe SSDs connected via RDMA networks.

Technical Architecture

3FS creates a specialized storage tier optimized for AI workloads. It leverages several key technologies:

RDMA (Remote Direct Memory Access) networking achieves microsecond-level latencies by bypassing traditional TCP/IP stacks
Chain Replication with Apportioned Queries (CRAQ) provides strong consistency without sacrificing throughput
Optimized I/O patterns that transform small random reads into more efficient batched operations
Stateless metadata services backed by transactional key-value stores (FoundationDB)

The performance data shows impressive results: peak read throughput of 40 GiB/s, translating to approximately 10-20 million KV pairs per second—sufficient to keep even the largest models running smoothly.

Memory Economics Transformed

The cost implications are substantial. DRAM typically costs $10-15 per GB, while enterprise NVMe storage runs around $1-2 per GB. For a large inference cluster requiring hundreds of terabytes of KV-Cache, this represents millions in infrastructure savings.

More importantly, it changes the scaling equation. Traditional high-memory servers face superlinear cost increases as memory requirements grow, while 3FS enables linear scaling with commodity storage hardware.

Technical Implementation Details

The system introduces a new tier in the memory hierarchy specifically for AI:

领英推荐

TAI 112; Agent Capabilities Advancing; METR Eval and…

Towards AI 7 个月前

Understanding Multi-Agent RAG Systems!

Pavan Belagatti 3 个月前

The Battle for AI Gravity

Tomasz Tunguz 8 个月前

CPU Cache (ns) → DRAM (100s ns) → 3FS KVCache (10s μs) → Traditional Storage (ms)

The I/O pattern optimization is particularly notable. Rather than suffering from small random reads (which SSDs handle poorly), they've implemented batched access patterns that amortize I/O costs. The garbage collection system shows sophisticated generational patterns, with regular IOPS spikes for removing obsolete KV pairs.

For integration, 3FS maintains standard file interfaces rather than introducing specialized APIs. This reduces adoption friction—existing PyTorch or TensorFlow code requires minimal changes to benefit from the distributed storage approach.

Technical Implications

The most significant technical implication is the effective removal of memory as the primary constraint in LLM inference. When KV-Cache can be economically scaled to hundreds of terabytes, new possibilities emerge:

Models that maintain context across entire documents or codebases
Multi-user inference servers that support thousands of concurrent sessions
Specialized architectures that optimize for compute efficiency rather than memory reduction

As NVMe performance continues to improve (PCIe 5.0 and beyond), the gap between DRAM and storage-based KV-Cache will narrow further. The disaggregated architecture also allows independent scaling of compute and storage resources based on workload characteristics.

3FS demonstrates how clever engineering at the systems level can sometimes outpace the need for algorithmic or hardware advances. By repurposing existing technology components—SSDs, RDMA, distributed systems principles—it delivers a solution that addresses one of the most pressing constraints in modern AI deployment.

see: https://github.com/deepseek-ai/3FS

thanks for reading. want more?

check https://harpagan.com/

要查看或添加评论，请登录

Maksym Huczynski的更多文章

Quick Dive into Language Modeling on Google Colab: A Streamlined Approach

2023年10月27日

Quick Dive into Language Modeling on Google Colab: A Streamlined Approach

Quick Start: Access Google Colab There's something inherently satisfying about directly interacting with advanced…

5 条评论
The Executive's Guide to Fine-Tuning Large Language Models: Unlocking Business Value

2023年10月25日

The Executive's Guide to Fine-Tuning Large Language Models: Unlocking Business Value

Introduction In the rapidly evolving digital era, a new kind of digital intellect has emerged, revolutionizing how…

1 条评论
Classifying and Sorting Emails using LLM: A Guide to the GPT-4 API and n8n workflows

2023年10月24日

Classifying and Sorting Emails using LLM: A Guide to the GPT-4 API and n8n workflows

The modern age has ushered in a plethora of digital communications, with email being a primary mode for both personal…

3 条评论
LinkedIn Boolean Search Operators: A Comprehensive Guide with 15 Examples

2023年10月24日

LinkedIn Boolean Search Operators: A Comprehensive Guide with 15 Examples

LinkedIn is an essential tool for recruiters, job seekers, and professionals looking to expand their network. One of…

3 条评论

3FS: A Technical Look at AI's Memory Solution

Maksym Huczynski

Building Visual AI Agents @ Harpagan.com

The KV-Cache Challenge

Technical Architecture

Memory Economics Transformed

Technical Implementation Details

领英推荐

Technical Implications

Maksym Huczynski的更多文章

社区洞察

其他会员也浏览了

Creating Robust Data Pipelines for AI with VAST Data

DeepSeek vs. OpenAI: Can AI Thrive Without Massive Compute?

Issue #286 - The ML Engineer ??

The next generation of AI for enterprise is here — don't take it for Granite

Understanding the AI Tech Stack

My 2025 AI Predictions

Issue #213 - THE ML ENGINEER ??

How Vector Databases Help Avoid Expensive, Eloquent, Wrong GenAI Answers (Vol. 10)

Papers Explained 1: Mistral 7B

DAVYD: A Deep Dive into Next-Generation Dataset Generation

The KV-Cache Challenge

Technical Architecture

Memory Economics Transformed

Technical Implementation Details

领英推荐

Technical Implications

Maksym Huczynski的更多文章

Quick Dive into Language Modeling on Google Colab: A Streamlined Approach

The Executive's Guide to Fine-Tuning Large Language Models: Unlocking Business Value

Classifying and Sorting Emails using LLM: A Guide to the GPT-4 API and n8n workflows

LinkedIn Boolean Search Operators: A Comprehensive Guide with 15 Examples

社区洞察

其他会员也浏览了

Creating Robust Data Pipelines for AI with VAST Data

DeepSeek vs. OpenAI: Can AI Thrive Without Massive Compute?

Issue #286 - The ML Engineer ??

The next generation of AI for enterprise is here — don't take it for Granite

Understanding the AI Tech Stack

My 2025 AI Predictions

Issue #213 - THE ML ENGINEER ??

How Vector Databases Help Avoid Expensive, Eloquent, Wrong GenAI Answers (Vol. 10)

Papers Explained 1: Mistral 7B

DAVYD: A Deep Dive into Next-Generation Dataset Generation