Paper Review: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Andrey Lukyanenko

Senior Data Scientist @ Careem. Kaggle Competition Master, Notebooks Top-1.

发布日期: 2024年6月17日

Samba is a new hybrid architecture designed to efficiently model sequences with infinite context length. It combines Mamba, a selective State Space Model, with Sliding Window Attention. This allows SAMBA to compress sequences into recurrent hidden states while retaining precise memory recall through attention mechanisms. With 3.8 billion parameters and training on 3.2 trillion tokens, SAMBA significantly outperforms state-of-the-art models based on pure attention or SSMs across various benchmarks. It can efficiently extrapolate from 4K length sequences to 256K context length with perfect memory recall and improve token predictions up to 1 million context length. SAMBA achieves 3.73× higher throughput compared to Transformers with grouped-query attention for processing 128K length prompts and a 3.64× speedup when generating 64K tokens with unlimited streaming.

The approach

Mamba is a recently proposed model based on selective state spaces. It employs input-dependent gating to selectively process elements of an input sequence. Mamba first expands the input sequence to a higher dimension using a learnable projection matrix. A Short Convolution operator is then applied to smooth the input signal, followed by Depthwise Convolution and SiLU activation.

The model calculates a selective gate through a low-rank projection and Softplus activation, initializing parameters to ensure the gate values remain within a specific range. For each time step, the recurrent inference of the Selective SSM is performed in an expanded state space, combining previous states with current inputs through a series of point-wise and outer product operations. The final output is obtained using a gating mechanism similar to the Gated Linear Unit.

Mamba’s layer in Samba captures the time-dependent semantics of input sequences through its recurrent structure and input selection mechanism, enabling the model to focus on relevant inputs and memorize important information over the long term.

The Sliding Window Attention layer is designed to overcome the limitations of the Mamba layer in capturing non-Markovian dependencies in sequences. Operating on a window size of 2048 that slides over the input sequence, SWA maintains linear computational complexity relative to sequence length. It applies RoPE relative positions within the window, allowing the model to retrieve high-definition signals from the middle to short-term history that Mamba’s recurrent states cannot capture clearly. SWA uses FlashAttention 2 for efficient self-attention implementation. The 2048 window size is chosen for efficiency, providing similar training speed to Mamba’s selective parallel scan at this sequence length.

Experiments

Samba outperforms several strong baselines, including Llama 2, Mistral, Mamba, Gemma, Recurrent-Gemma, Llama 3, and TFM++. It achieves the highest average scores on all benchmarks and excels in the GSM8K benchmark with an 18.1% higher accuracy than TFM++.

In an evaluation of six models with around 1.7B parameters on the Phi2 dataset, Samba demonstrated superior performance across 15 downstream benchmarks, including commonsense reasoning, language understanding, TruthfulQA, and code generation. Samba outperformed both pure attention-based and SSM-based models in most tasks, achieving the best average performance.

Replacing Mamba blocks with MLPs did not harm commonsense reasoning but significantly reduced performance in language understanding and complex reasoning tasks. Pure Mamba models struggled with retrieval-intensive tasks like SQuAD due to their lack of precise memory retrieval ability. The best results were achieved through the combination of attention and Mamba modules in the Samba architecture. The Mamba-SWA-MLP combination showed significantly better performance on GSM8K, indicating effective collaboration between Mamba and SWA layers.

Exploration on Attention and Linear Recurrence

Quantiphi 6 个月前

Why AI Can’t Replace Programmers: The Limits of…

Amr Saafan 2 个月前

Best Programming Languages for AI: A Comprehensive…

GUVI Geek Networks, IITM Research Park 1 个月前

There are other models combining attention layers with recurrent:

Llama-2 is an attention-based Transformer with full self-attention across the entire sequence.
Llama-2-SWA is an attention-based architecture that replaces all full attention layers with sliding window attention.
Sliding RetNet replaces Mamba layers in the Samba architecture with Multi-Scale Retention layers. RetNet is a linear attention model with fixed and input-independent decay applying to the recurrent hidden states.
Sliding GLA replaces Mamba layers in the Samba architecture with Gated Linear Attention. GLA is a more expressive variant of linear attention with input-dependent gating.

Samba consistently outperforms all other models across different context lengths and model sizes. Its training speed is competitive with pure Transformer-based models at the 1.3B scale.

Efficient Length Extrapolation

The evaluation of models at a scale of around 1.7B parameters using the Proof-Pile dataset focuses on length extrapolation ability. Data pre-processing follows Position Interpolation, and the sliding window approach with a window size of 4096 is used for perplexity evaluation.

Samba achieves 3.73× higher throughput in prompt processing compared to Llama-3 1.6B at a 128K prompt length, maintaining linear processing time relative to sequence length.
Existing zero-shot length extrapolation techniques introduce significant inference latency and fail to match Samba’s perplexity performance.
Samba can extrapolate its memory recall ability to a 256K context length through supervised fine-tuning while keeping linear computation complexity.
Fine-tuning Samba 1.7B on Passkey Retrieval demonstrates superior long-range retrieval ability compared to Mistral 1.6B. Samba achieves near-perfect retrieval performance early in the training process, while Mistral struggles with around 30% accuracy.

Long-Context Understanding

Following the same post-training recipe used for the Phi-3-mini series, the instruction-tuned Samba-3.8B-IT was evaluated on both long-context summarization tasks (GovReport, SQuALITY) and main short-context benchmarks (MMLU, GSM8K, HumanEval). Samba-3.8B-IT outperforms Phi-3-mini-4k-instruct on both short-context and long-context tasks.

Analysis

Despite SWA’s linear complexity with sequence length, increasing sequence length leads to higher validation perplexity due to smaller batch sizes. The optimal ratio of sequence length to window size is 2, resulting in a training length of 4096.

Hybridizing with full attention is not ideal, as it leads to exploding perplexity at longer context lengths and worse training throughput compared to using self-attention with FlashAttention 2. Mamba can capture low-rank information, so attention layers in Samba focus on information retrieval, requiring fewer attention heads. Samba performs better with a smaller number of query heads than Llama-2-SWA.

The hybrid architecture is advantageous due to the specialization of attention layers in Samba, which focus on global integration in upper and lower layers and precise retrieval in middle layers. This specialization improves downstream performance.

The Short Convolution operator, used in Mamba, can enhance other linear recurrent models. Adding SC improves the performance of Llama-2-SWA and Sliding RetNet, but not Sliding GLA, which already has fine-grained decays at the channel level. Even with SC, these models do not outperform the original Samba, justifying the use of Mamba for hybridization. Adding SC to both SWA and linear attention layers in hybrid models produces negative results, suggesting future research to understand SC’s effectiveness in language modeling.

要查看或添加评论，请登录

查看全部

Paper Review: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Andrey Lukyanenko

Senior Data Scientist @ Careem. Kaggle Competition Master, Notebooks Top-1.

The approach

Experiments

Exploration on Attention and Linear Recurrence

领英推荐

Efficient Length Extrapolation

Long-Context Understanding

Analysis

更多精彩文章

社区洞察

其他会员也浏览了

Autonomous Ops with LLM for Advanced Anomaly Detection

Two Vital Python Programming Approaches For Your Website’s SEO

How is Python shaping the future of NLP and machine learning in big data analytics?

Text-to-Code Gen AI: Revolutionizing Software Development

Paper Review: Collaborative Large Language Model for Recommender Systems

Microsoft's SAMBA-Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling & Quantinuum's Quixer-Quantum Transformer Model

7 Awesome and Free AI Tools You Should Know

AI Evolution and Coding Paradigms

Best Python Sentiment Analysis Libraries: Unleashing the Power of Text Analysis

?Embracing the New Era: Coding with Natural Language Conversations?

The approach

Experiments

Exploration on Attention and Linear Recurrence

领英推荐

Efficient Length Extrapolation

Long-Context Understanding

Analysis

Paper Review: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

2024年10月7日

Paper Review: Training Language Models to Self-Correct via Reinforcement Learning

2024年9月23日

Paper Review: Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

2024年9月16日

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis

2024年9月4日

Paper Review: Winning Amazon KDD Cup24

2024年8月19日

Paper Review: Wolf: Captioning Everything with a World Summarization Framework

2024年8月12日

Paper Review: Diffusion Feedback Helps CLIP See?Better

2024年8月5日

Paper Review: Masked Attention is All You Need for Graphs

2024年7月29日

Paper Review: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024年7月22日

Paper Review: Unveiling Encoder-Free Vision-Language Models

2024年7月15日

社区洞察

其他会员也浏览了

Autonomous Ops with LLM for Advanced Anomaly Detection

Two Vital Python Programming Approaches For Your Website’s SEO

How is Python shaping the future of NLP and machine learning in big data analytics?

Text-to-Code Gen AI: Revolutionizing Software Development

Paper Review: Collaborative Large Language Model for Recommender Systems

Microsoft's SAMBA-Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling & Quantinuum's Quixer-Quantum Transformer Model

7 Awesome and Free AI Tools You Should Know

AI Evolution and Coding Paradigms

Best Python Sentiment Analysis Libraries: Unleashing the Power of Text Analysis

?Embracing the New Era: Coding with Natural Language Conversations?