MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax-01: Scaling Foundation Models with Lightning Attention

Today's paper introduces MiniMax-01, a series of foundation models that achieve performance comparable to top-tier models while offering significantly longer context windows. The model utilizes lightning attention and Mixture of Experts (MoE) architecture to efficiently process sequences up to 4 million tokens. The paper presents comprehensive details on architecture design, computation optimization, and training.

Method Overview

The approach combines lightning attention, a linear attention variant, with Mixture of Experts (MoE) architecture to create an efficient and scalable model. The architecture uses a hybrid design where one transformer block with softmax attention follows every seven transnormer blocks with lightning attention.

The MiniMax-Text-01 model has 32 experts and contains 456 billion total parameters, of which 45.9 billion are activated for each token. This design allows for efficient processing of long sequences while maintaining high performance. The implementation includes optimized parallel strategies and computation-communication overlap techniques specifically designed for MoE and lightning attention.

To support training and inference with such long contexts, the paper introduces several optimization techniques. These include varlen ring attention to reduce computation redundancy and an improved version of Linear Attention Sequence Parallelism (LASP). The system also implements specialized CUDA kernels for lightning attention inference, achieving over 75% Model Flops Utilization on Nvidia H20 GPUs.

The training process involves careful data curation, quality enhancement through reward-based filtering, and a three-stage training procedure to extend the context window to one million tokens. For vision capabilities, a lightweight Vision Transformer module is integrated through additional training with 512 billion vision-language tokens.

Results

The MiniMax-01 models show performance comparable to leading commercial models on standard benchmarks while offering 20-32 times longer context windows. The models show particularly strong performance in contexts longer than 200,000 tokens. The vision-language model, MiniMax-VL-01, achieves competitive performance on multimodal benchmarks.

Conclusion

The paper successfully demonstrates that it's possible to build foundation models supporting extremely long context windows while maintaining competitive performance with top-tier models. The combination of lightning attention and MoE architecture, along with various optimization techniques, enables efficient processing of sequences up to 4 million tokens. For more information please consult the full paper.

Congrats to the authors for their work!

Citation: MiniMax. "MiniMax-01: Scaling Foundation Models with Lightning Attention." arXiv:2501.08313v1 [cs.CL], 14 Jan 2025.

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了