Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba

Exploring the Evolution Beyond Transformers: Unveiling the Power of State Space Models with Mamba


Could there be a better architecture than Transformers?

The Transformer architecture has revolutionized the field of machine learning, starting with natural language processing. They have been the competing architecture on all modalities. However, despite its unparalleled performance, there are inherent limitations, particularly regarding speed and memory efficiency. Could there be a better architecture that addresses these drawbacks? Enter the world of State Space Models (SSMs) and architectures such as Mamba that make SSM better with a Selective approach.


Understanding State Space Models (SSMs)

State Space Models (SSMs) are a novel class of sequence models that blend principles of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). SSMs essentially work like linear RNNs and process new hidden states. Just like in RNNs, the output representation of the previous token and embeddings of the current input token are transformed and combined.

In detail, SSMs have 4 matrices -> Δ, A, B, C that have different responsibilities.

  • Δ modifies the weights in A and B matrices in a discretization step. Δ is a parameter that is learned from training using backpropagation.
  • Modified A determines how much of the hidden state should propagate forward.
  • Modified B determines how much of the input should enter the hidden states.
  • C determines how the hidden state transforms into output


With modified matrices, SSMs go over token by token similar to linear RNNs and process a new hidden state for each token. Following is the way to get a hidden state for token T and get the final representation:


Historically, Transformers have outperformed RNNs in training speed. Transformers process the entire input sequence simultaneously, enabling parallel computation, while RNNs handle one token at a time sequentially. SSMs, however, offer a blend of both worlds. Although SSMs process tokens sequentially like RNNs, they achieve fast and parallel computation during training. This is because SSMs can pre-compute and execute for all tokens in the input sequence simultaneously, thanks to their linear computations.

In SSMs, the hidden state and output token calculations can be processed in a Convolutional mode. The entire computation is combined into a large matrix KK, and the input vectors into another matrix XX. By convolving KK with XX, all output tokens are obtained simultaneously and in parallel. During inference, SSMs switch to a Recurrence mode, processing tokens one after the other. This dual-mode operation allows SSMs to handle longer sequences more efficiently, avoiding the out-of-memory errors that Transformers might encounter.


While SSMs excel in speed and memory efficiency, they have traditionally lagged behind Transformers in performance. However, advancements are closing this gap. Here is a comparison of how SSMs usually stack up against Transformers:



Introducing the Mamba Architecture:

One of the main reasons why SSMs have historically underperformed compared to Transformers is their uniform application of the same Δ, A, B and C matrices to all inputs, failing to distinguish between them. It's crucial to process input tokens differently, remembering some while forgetting others as needed.

The Mamba architecture enhances the performance of traditional SSMs by integrating selective state spaces. This selective mechanism allows the model to dynamically choose which information to retain or discard based on the input. In Mamba, different Δ, B', and C' matrices are computed for each input token. This enables the SSM to focus on specific tokens more effectively, akin to the attention mechanism in Transformers.

While this selective approach is beneficial, it complicates fast training since the Convolutional trick used previously doesn't work with varied Δ, B', and C' matrices. However, the authors of Mamba present a solution using Parallel Associative Scan, which yields intermediate states and enables linear time computation. This approach allows Mamba to maintain efficiency in both training and inference. The following are the key strategies they suggest:


Mamba's hardware-aware scan implementation makes it significantly faster during both training and inference. This efficiency is achieved without compromising the model's ability to perform complex tasks, making it a robust alternative to Transformers.

Overall one Mamba block comprises of this Selective SSM module combined with Linear projection layer, a canonical 1D Convolution layer and and a SiLU Activation function. This Mamba block can be stacked multiple times without having any other layers in between. Following is one Mamba block:



Key features of Mamba architecture:

  • Selection Mechanism: SSM parameters are input-dependent, allowing selective propagation or forgetting of information.
  • Hardware-Aware Algorithm: Efficiently utilizes GPU memory hierarchy, leading to faster computations.
  • Simplified Architecture: Combines SSMs with minimalistic design, eliminating the need for attention or MLP blocks.


This innovaiton of Mamba in SSMs has demonstrated that SSM can be as performant as Transformer while having lower parameters and reduced latency. It also demonstrated to work with longer sequenece lenght where Transformer architecture run out of memory. Following are few results posted by the authors in this paper: https://arxiv.org/pdf/2312.00752



The advancement of Structured State Space Models, exemplified by the Mamba architecture, marks a significant milestone in sequence modeling. With its superior speed, memory efficiency, and competitive performance, SSMs, particularly Mamba, are poised to complement and, in some cases, surpass Transformers. As the field continues to evolve, the potential for SSMs to become a foundational architecture in machine learning looks promising.


Additional Resources:


I'm excited about the advancements in SSMs, as seen with the Mamba architecture and beyond, and look forward to their impact on the future of machine learning.


要查看或添加评论,请登录

Nikunj Kotecha的更多文章

社区洞察

其他会员也浏览了