登录查看更多内容

15x Faster than Llama 2: DeciLM, a NAS-Generated LLM with Variable GQA

Deci AI (Acquired by NVIDIA)

Deci enables deep learning to live up to its true potential by using AI to build better AI.

发布日期: 2023年9月13日

1. Introduction?

As the deep learning community continues to push the boundaries of Large Language Models (LLMs), the computational demands of these models have surged exponentially for both training and inference. This escalation has not only led to increased costs and energy consumption but also introduced barriers to their deployment and scalability. Achieving a balance between model performance, computational efficiency, and latency has thus become a focal point in recent LLM development.

Within this landscape, we are thrilled to introduce DeciLM 6B, a permissively licensed, open-source foundation LLM, and DeciLM 6B-Instruct, fine-tuned from DeciLM 6B for instruction-following use cases. With 5.7 billion parameters, DeciLM 6B delivers a throughput that's 15 times higher than Llama 2 7B while maintaining comparable quality. Impressively, despite having significantly fewer parameters, DeciLM 6B and DeciLM 6B-Instruct consistently rank among the top-performing open-source LLMs in the 7 billion parameter category across various LLM evaluation tasks. Our models thus establish a new benchmark for inference efficiency and speed. The hallmark of DeciLM 6B lies in its unique architecture, generated using AutoNAC, Deci's cutting-edge Neural Architecture Search engine, to push the efficient frontier. Moreover, coupling DeciLM 6B with Deci’s inference SDK results in a substantial throughput enhancement.

The implications of such a significant increase in inference efficiency are far-reaching. They include a better user experience for applications built on top of the model, a meaningful reduction in inference cost due to the ability to run the models on more affordable GPUs, and a sizable reduction in carbon footprint.

DeciLM, with its remarkable evaluation results on standard LLM evaluation benchmarks, can be used in a wide range of generative AI applications—from support chatbots to content creation assistants, and more. The significant efficiency enhancement from DeciLM greatly elevates the viability and scalability of these generative AI applications.

In the subsequent sections, we delve into the technical intricacies that underlie DeciLM, elucidate the novel attention mechanisms it employs, and present empirical results that attest to its superior performance and efficiency.

2. DeciLM’s Architectural Innovations

DeciLM’s decoder-only transformer architecture features a unique implementation of variable?Grouped-Query Attention?(GQA). Unlike other transformer models utilizing GQA, such as Llama 2 70B, which maintains consistent attention groups per transformer layer, DeciLM varies the number of attention groups, keys, and values across transformer layers. To the best of our knowledge, DeciLM is the first LM architecture where the transformer layers are not structural duplicates of one another.

2.1 Variable Grouped-Query Attention

Background

The 2017 paper?“Attention Is All You Need”?introduced the Transformer architecture, reshaping sequence-to-sequence modeling. One of its standout features, Multi-Head Attention (MHA), uses multiple parallel “heads.” Each attention head has its own distinct set of weight matrices that produce queries (Q), keys (K), and values (V) from the input data. The intuition is that each head captures different aspects or relationships in the data. ?

Multi-Query Attention

Despite MHA’s effectiveness, its increased parameters amplify computational and memory demands, especially in larger models. To mitigate this while preserving attention’s expressiveness, researchers introduced?Multi-Query Attention?(MQA). MQA reduces the number of parameters in the attention mechanism to make it more memory and computationally efficient and speed up inference:

In MQA, the model still maintains multiple query (Q) heads.
However, instead of having separate Ks and Vs for each head, all heads share the same K and V.

Grouped-Query Attention

While MQA increases the computational and memory efficiency of the model, it does lead to quality degradation. GQA was introduced as an enhancement over MQA, designed specifically to offer a superior balance between memory and computational efficiency and model quality:

Query Grouping: The queries in GQA are divided into groups, with each group sharing a single K and V computation. This provides some level of parameter sharing, but not as much as MQA.
Specialized Attention Patterns: By granting distinct keys and values to each query group, the model can identify a broader range of relationships in the input, which leads to more refined attention patterns than in MQA.

Variable Grouped-Query Attention in DeciLM

With DeciLM, we introduce Variable GQA to further optimize the tradeoff between efficiency and model quality. Unlike other models, such as Llama 2 70B, that consistently apply the same number of groups across all layers, DeciLM introduces variability in its approach. Specifically, while maintaining 32 queries/heads per layer, DeciLM’s layers vary in their GQA group parameter:

Some layers utilize 4 groups, with 8 queries per group and only 4 K and 4 V heads per layer.
Others apply 2 groups, with 16 queries per group and 2 K and 2 V parameters per layer.?
While certain layers have 1 group (as in MQA), with all 32 queries in that group and a single key and value head.?

Variable Grouped-Query Attention in DeciLM

This strategic layer-specific variation is pivotal. By tailoring the grouping to each layer’s unique requirements, DeciLM strikes an optimal balance between inference speed and the quality of the model’s outputs. It capitalizes on both the computational and memory efficiency of grouped attention and the nuanced understanding derived from diverse attention patterns.

2.2. AutoNAC: The Engine Behind DeciLM’s Architectural Excellence

The architecture of DeciLM was generated using Deci’s proprietary Neural Architecture Search (NAS) engine, AutoNAC. Traditional NAS methods, albeit promising, are computationally intensive. AutoNAC addresses this challenge by automating the search process in a compute-efficient manner. The engine has been instrumental in generating a wide range of high-efficiency foundation models, including the state-of-the-art object detection model?YOLO-NAS, the text-to-image model?DeciDiffusion, and the code generation LLM,?DeciCoder. In the case of DeciLM, AutoNAC was pivotal in selecting the optimal GQA group parameter for each transformer layer of the model.?

3. DeciLM’s Training

DeciLM 6B underwent training utilizing a subset of the SlimPajamas dataset, leveraging advanced proprietary methodologies allowing for fast training. The SlimPajamas dataset is an extensive deduplicated, multi-corpora open-source dataset, publicly available on Hugging Face under the Apache 2.0 license.?DeciLM 6B then underwent?LoRA?fine-tuning, resulting in DeciLM 6B-Instruct.

4. Performance Analysis: DeciLM’s Standout Metrics Among Peers

We benchmarked DeciLM 6B and DeciLM 6B-Instruct against leading open-source models in the 7-billion parameter class, such as Llama 2 7B, Llama 2 7B-Chat, Falcon 7B, and MPT-Instruct. Even with significantly fewer parameters, DeciLM 6B-Instruct clinched the third spot, trailing Llama 2 7B by just under a percentage point.

领英推荐

The Limits of Retrieval Augmentation, 8 AI Research…

Open Data Science Conference (ODSC) 1 年前

Anthropic Raises the Bar with Claude 3

Lightning AI 1 年前

??Top ML Papers of the Week

DAIR.AI 10 个月前

Benchmark scores of top open-source LLMs

5. DeciLM’s?Trailblazing Inference Efficiency

5.1 Comparative Analysis: DeclLM 6B vs. Llama 2 7B

To assess the efficiency enhancements brought by DeciLM’s architecture, we performed a side-by-side PyTorch comparison with Llama 2 7B, utilizing an NVIDIA A10G GPU for the tests. Our tests show that DeciLM is both more memory efficient and has a higher throughput.

Memory Efficiency:?DeciLM’s architecture showcases enhanced memory efficiency. Our empirical observations indicated that DeciLM optimally processes data at a batch size 64. In comparison, Llama 2 7B demonstrated a maximum batch size of 8. DeciLM’s capability to handle larger batch sizes translates into more efficient inference with better utilization of the GPU hardware, unhindered by the memory constraints that limited previous models.

Throughput:?DeciLM 6B’s throughput (tokens per second measured with an optimal batch on A10G) is?4.8 times?that of Llama 2 7B’s.?

5.2 Infery-LLM: The Ultimate Turbo Boost for Large Language Model Inference?

Background:?

Infery-LLM is a specialized inference SDK developed by Deci, designed to enhance the computational processing of Large Language Models. Built upon advanced engineering techniques such as selective quantization and hybrid compilation and incorporating proprietary optimized kernels, fast beam search and more, Infery stands out as an instrumental tool for LLM inference acceleration.

Throughput Boost and Incomparable Efficiency:?

When applied to LLMs such as DeciLM, Infery-LLM consistently delivers superior performance outcomes. The architecture and optimization strategies employed by Infery-LLM ensure that models achieve peak efficiency without compromising output quality. For instance, when running with Infery-LLM, DeciLM 6B’s throughput is 15 times that of Llama 2 7B’s on NVIDIA’s A10G GPU.?

Cost and Environmental Implications:??

The efficiency achieved through the combination of DeciLM and Infery-LLM results in a staggering reduction in inference costs. For one, it enables you to migrate workloads from the pricey A100/H100 GPUs to the more affordable A10G. Secondly, when running on the same hardware, DeciLM’s cost per 1k tokens is 95% lower than that of Llama 2 7B.

This efficiency also translates into a commendable environmental impact, cutting the carbon footprint by 516kg CO2 per model instance yearly on the A10G GPU.

What This Means for Generative AI Use Cases:?

For applications where real-time response and high throughput are paramount, the integration of Infery-LLM proves invaluable, with the potential to drastically improve user experience. Additionally, it provides a streamlined approach to achieving incomparable efficiency of DeciLM and comparable LLMs.

6. DeciLM’s Availability to the Open-Source Community

In alignment with our commitment to propel the wider adoption of efficient models, Deci is proud to release DeciLM to the open-source community. Our intent is to ensure it remains easily accessible and user-friendly, fostering a culture of learning and innovation. DeciLM is available for?free download?and is permissively licensed under the?Llama 2 Community License Agreement. We encourage researchers, developers, and enthusiasts to leverage this state-of-the-art foundation model in their work.

Conclusion

DeciLM 6B signifies a landmark development in the realm of LLMs, paving the way for models that can seamlessly integrate into real-world applications without taxing computational resources. Its innovative design sets new benchmarks and provides valuable insights for future explorations in the domain of efficient AI modeling.

Dive Deeper: Navigate through?our notebook?for an in-depth exploration of using the model.
Experience in Action: Test the model’s capabilities with our?interactive demo.
Get Started: Access and download the model directly from its?Hugging Face repository.

To learn more about Infery-LLM and how you can use it to accelerate DeciLM,?book a demo.?

This article was first published on the Deci website.

INTELLITHING

1 年

CMM is suitable for knowledge intense sectors as this method is highly scalable and customizable. CMM uses INTELLITHING’s domain specific large language models (LLMS) and as the name suggests, you can mould the model around your requirements with ease. CMM allows you to mould and grow a model without using prompts, This enables you to build highly customizable and accurate applications at scale. ??

15x Faster than Llama 2: DeciLM, a NAS-Generated LLM with Variable GQA

Deci AI (Acquired by NVIDIA)

Deci enables deep learning to live up to its true potential by using AI to build better AI.

1. Introduction?

2. DeciLM’s Architectural Innovations

2.1 Variable Grouped-Query Attention

Background

Multi-Query Attention

Grouped-Query Attention

Variable Grouped-Query Attention in DeciLM

2.2. AutoNAC: The Engine Behind DeciLM’s Architectural Excellence

3. DeciLM’s Training

4. Performance Analysis: DeciLM’s Standout Metrics Among Peers

领英推荐

5. DeciLM’s?Trailblazing Inference Efficiency

5.1 Comparative Analysis: DeclLM 6B vs. Llama 2 7B

5.2 Infery-LLM: The Ultimate Turbo Boost for Large Language Model Inference?

Background:?

Throughput Boost and Incomparable Efficiency:?

Cost and Environmental Implications:??

What This Means for Generative AI Use Cases:?

6. DeciLM’s Availability to the Open-Source Community

Conclusion

Deci AI (Acquired by NVIDIA)的更多文章

社区洞察

其他会员也浏览了

ML Papers of The Week (Jan 1-8)

Large Language Models (LLMs) and Inference: The Role of Data Centers and Colocation in AI

AI and Machine Learning in 2024

Introducing GPT-4 o1: The Next Leap in Human-Like AI Reasoning

The State of AI in 2024: A Closer Look at Key AI Models by Leading Tech Companies

Tech Talk. Technology of the future

Practical AI: From Theory to Added Value (Part 2)

?? Is Google Back in the AI Race?

Top AI/ML Papers of the Week [16/12 - 22/12]

Empowering Artificial Intelligence with RAG: The New Era of Retrieval and Content Generation with Databricks and Mosaic AI

1. Introduction?

2. DeciLM’s Architectural Innovations

2.1 Variable Grouped-Query Attention

Background

Multi-Query Attention

Grouped-Query Attention

Variable Grouped-Query Attention in DeciLM

2.2. AutoNAC: The Engine Behind DeciLM’s Architectural Excellence

3. DeciLM’s Training

4. Performance Analysis: DeciLM’s Standout Metrics Among Peers

领英推荐

5. DeciLM’s?Trailblazing Inference Efficiency

5.1 Comparative Analysis: DeclLM 6B vs. Llama 2 7B

5.2 Infery-LLM: The Ultimate Turbo Boost for Large Language Model Inference?

Background:?

Throughput Boost and Incomparable Efficiency:?

Cost and Environmental Implications:??

What This Means for Generative AI Use Cases:?

6. DeciLM’s Availability to the Open-Source Community

Conclusion

Deci AI (Acquired by NVIDIA)的更多文章

How to Improve Small Object Detection Accuracy Without Increasing Latency

Just Launched: Deci’s Gen AI Development Platform and Deci-Nano

What makes LLM inference more challenging than traditional NLP?

YOLO-NAS-Sat: A Small Object Detection Model for Edge Deployment

Exploring the Modern Transformer - From 'Attention Is All You Need' to SwiGLU, RoPE, and GQA

How to Build Better AI Models with a Production-Aware Approach and NAS

DeciCoder-6B and DeciDiffusion 2.0: Models Built for Accuracy, Speed, and Cost-Efficiency

Maximizing LLM Inference Speed: Proven Strategies and Best Practices

DeciLM-7B: The Fastest and Most Accurate 7 Billion-Parameter LLM to Date ??

Key Factors to Success of YOLO-NAS Pose ??

社区洞察

其他会员也浏览了

ML Papers of The Week (Jan 1-8)

Large Language Models (LLMs) and Inference: The Role of Data Centers and Colocation in AI

AI and Machine Learning in 2024

Introducing GPT-4 o1: The Next Leap in Human-Like AI Reasoning

The State of AI in 2024: A Closer Look at Key AI Models by Leading Tech Companies

Tech Talk. Technology of the future

Practical AI: From Theory to Added Value (Part 2)

?? Is Google Back in the AI Race?

Top AI/ML Papers of the Week [16/12 - 22/12]

Empowering Artificial Intelligence with RAG: The New Era of Retrieval and Content Generation with Databricks and Mosaic AI