登录查看更多内容

Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba

Nikunj Kotecha

Senior Machine Learning Engineer ? Top AI Voice ??AWS Certified ? Ex- BrainChip, Oracle ? Researcher ? Graduate Advisor/Mentor ? Edge AI ? GenAI & LLMs

发布日期: 2024年7月10日

Could there be a better architecture than Transformers?

The Transformer architecture has revolutionized the field of machine learning, starting with natural language processing. They have been the competing architecture on all modalities. However, despite its unparalleled performance, there are inherent limitations, particularly regarding speed and memory efficiency. Could there be a better architecture that addresses these drawbacks? Enter the world of State Space Models (SSMs) and architectures such as Mamba that make SSM better with a Selective approach.

Understanding State Space Models (SSMs)

State Space Models (SSMs) are a novel class of sequence models that blend principles of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). SSMs essentially work like linear RNNs and process new hidden states. Just like in RNNs, the output representation of the previous token and embeddings of the current input token are transformed and combined.

In detail, SSMs have 4 matrices -> Δ, A, B, C that have different responsibilities.

Δ modifies the weights in A and B matrices in a discretization step. Δ is a parameter that is learned from training using backpropagation.
Modified A determines how much of the hidden state should propagate forward.
Modified B determines how much of the input should enter the hidden states.
C determines how the hidden state transforms into output

With modified matrices, SSMs go over token by token similar to linear RNNs and process a new hidden state for each token. Following is the way to get a hidden state for token T and get the final representation:

Historically, Transformers have outperformed RNNs in training speed. Transformers process the entire input sequence simultaneously, enabling parallel computation, while RNNs handle one token at a time sequentially. SSMs, however, offer a blend of both worlds. Although SSMs process tokens sequentially like RNNs, they achieve fast and parallel computation during training. This is because SSMs can pre-compute and execute for all tokens in the input sequence simultaneously, thanks to their linear computations.

In SSMs, the hidden state and output token calculations can be processed in a Convolutional mode. The entire computation is combined into a large matrix KK, and the input vectors into another matrix XX. By convolving KK with XX, all output tokens are obtained simultaneously and in parallel. During inference, SSMs switch to a Recurrence mode, processing tokens one after the other. This dual-mode operation allows SSMs to handle longer sequences more efficiently, avoiding the out-of-memory errors that Transformers might encounter.

While SSMs excel in speed and memory efficiency, they have traditionally lagged behind Transformers in performance. However, advancements are closing this gap. Here is a comparison of how SSMs usually stack up against Transformers:

Introducing the Mamba Architecture:

One of the main reasons why SSMs have historically underperformed compared to Transformers is their uniform application of the same Δ, A, B and C matrices to all inputs, failing to distinguish between them. It's crucial to process input tokens differently, remembering some while forgetting others as needed.

The Mamba architecture enhances the performance of traditional SSMs by integrating selective state spaces. This selective mechanism allows the model to dynamically choose which information to retain or discard based on the input. In Mamba, different Δ, B', and C' matrices are computed for each input token. This enables the SSM to focus on specific tokens more effectively, akin to the attention mechanism in Transformers.

领英推荐

What are RNNs, how they work, why RNNs to generate…

Irshad Mohammad 2 个月前

?? A New Neural Architecture (Again)

Pascal Biese 6 个月前

Can I simulate a financial time series process with a…

Lars Warren Ericson 6 个月前

While this selective approach is beneficial, it complicates fast training since the Convolutional trick used previously doesn't work with varied Δ, B', and C' matrices. However, the authors of Mamba present a solution using Parallel Associative Scan, which yields intermediate states and enables linear time computation. This approach allows Mamba to maintain efficiency in both training and inference. The following are the key strategies they suggest:

Mamba's hardware-aware scan implementation makes it significantly faster during both training and inference. This efficiency is achieved without compromising the model's ability to perform complex tasks, making it a robust alternative to Transformers.

Overall one Mamba block comprises of this Selective SSM module combined with Linear projection layer, a canonical 1D Convolution layer and and a SiLU Activation function. This Mamba block can be stacked multiple times without having any other layers in between. Following is one Mamba block:

Key features of Mamba architecture:

Selection Mechanism: SSM parameters are input-dependent, allowing selective propagation or forgetting of information.
Hardware-Aware Algorithm: Efficiently utilizes GPU memory hierarchy, leading to faster computations.
Simplified Architecture: Combines SSMs with minimalistic design, eliminating the need for attention or MLP blocks.

This innovaiton of Mamba in SSMs has demonstrated that SSM can be as performant as Transformer while having lower parameters and reduced latency. It also demonstrated to work with longer sequenece lenght where Transformer architecture run out of memory. Following are few results posted by the authors in this paper: https://arxiv.org/pdf/2312.00752

The advancement of Structured State Space Models, exemplified by the Mamba architecture, marks a significant milestone in sequence modeling. With its superior speed, memory efficiency, and competitive performance, SSMs, particularly Mamba, are poised to complement and, in some cases, surpass Transformers. As the field continues to evolve, the potential for SSMs to become a foundational architecture in machine learning looks promising.

Additional Resources:

Original paper for Mamba: https://arxiv.org/pdf/2312.00752
Authors introduced Mamba 2 architecture: https://arxiv.org/pdf/2405.21060
Vision Mamba (ViM): https://arxiv.org/pdf/2401.09417
GitHub Repo for Mamba implementation: https://github.com/state-spaces/mamba

I'm excited about the advancements in SSMs, as seen with the Mamba architecture and beyond, and look forward to their impact on the future of machine learning.

要查看或添加评论，请登录

Nikunj Kotecha的更多文章

Quick Guide to Quantization in Machine Learning

2024年6月22日

Quick Guide to Quantization in Machine Learning

Overview: Fig #2. Source: https://arxiv.

1 条评论
Top 20 Linux Commands for every Machine Learning Engineer

2024年6月13日

Top 20 Linux Commands for every Machine Learning Engineer

As a Machine Learning engineer, you often work with large datasets, train complex models, and manage extensive…

2 条评论

Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba

Nikunj Kotecha

Senior Machine Learning Engineer ? Top AI Voice ??AWS Certified ? Ex- BrainChip, Oracle ? Researcher ? Graduate Advisor/Mentor ? Edge AI ? GenAI & LLMs

Could there be a better architecture than Transformers?

Understanding State Space Models (SSMs)

Introducing the Mamba Architecture:

领英推荐

Key features of Mamba architecture:

Additional Resources:

Nikunj Kotecha的更多文章

社区洞察

其他会员也浏览了

Decoding Transformers on Edge Devices

Transformers Model, The Neural Network Without Convolutional and Recurrent Layer

Unveiling Complexity: Innovating World Simulations with Sora’s Deep-Physical Fusion

The Amazing Journey of Transformer Architecture

Introduction to Advanced Traffic Modeling with GPT & CTG++

The Rise of Vision Transformers: Taking Vaswani's 'Attention' Concepts from text to images.

Noisy by Nature: How AI Learns to Shush the Static

Generative Adversarial Networks (GANs)

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI

Could there be a better architecture than Transformers?

Understanding State Space Models (SSMs)

Introducing the Mamba Architecture:

领英推荐

Key features of Mamba architecture:

Additional Resources:

Nikunj Kotecha的更多文章

Quick Guide to Quantization in Machine Learning

Top 20 Linux Commands for every Machine Learning Engineer

社区洞察

其他会员也浏览了

Decoding Transformers on Edge Devices

Transformers Model, The Neural Network Without Convolutional and Recurrent Layer

Unveiling Complexity: Innovating World Simulations with Sora’s Deep-Physical Fusion

The Amazing Journey of Transformer Architecture

Introduction to Advanced Traffic Modeling with GPT & CTG++

The Rise of Vision Transformers: Taking Vaswani's 'Attention' Concepts from text to images.

Noisy by Nature: How AI Learns to Shush the Static

Generative Adversarial Networks (GANs)

Activation functions. Sparking Neurons to Life: The Unsung Heroes of AI