登录查看更多内容

Prompt Compression in Large Language Models

Ashish Bhatia

Product Manager @ Microsoft

发布日期: 2023年12月29日

Introduction

In the landscape of large language models like GPT-4, the issue of prompt length emerges as a critical consideration. The currency of transaction in the LLMs world is 'tokens'. Each word or piece of information is represented as a token, and the number of these tokens dictates the complexity and computational load of processing language data. Just like high-resolution images in digital media are made up of numerous pixels, lengthy prompts in language models are composed of a large number of tokens. These sophisticated models, designed to understand and generate human-like text, face a challenge similar to digital media: handling large volumes of data efficiently. Just as high-resolution images demand substantial storage and processing, extensive prompts in language models increase computational requirements and hence the cost of each LLM request. The concept of prompt compression, therefore, is not just a parallel to image or audio compression but a necessity. It involves reducing the size of input data (prompts) while preserving its essential meaning, enhancing the models' response efficiency and reducing computational resources. This process is vital for the practical application of these models, ensuring they remain fast and cost-effective in various real-world scenarios.

Methodologies in Prompt Compression

The paper LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models introduces innovative methodologies for prompt compression in language models like GPT-4. These methods are designed to make prompts shorter and easier for the models to process, while still keeping the crucial information. This process consists of several innovative strategies:

Budget Controller:

This technique involves smartly dividing the prompt into different parts (like instructions, examples, and questions) and deciding how much each part should be compressed. It's like balancing quality and size in image compression, but here it's about keeping the important parts of a prompt clear and concise. It ensures that crucial sections like instructions and questions are less compressed compared to potentially redundant demonstrations. This selective compression is analogous to how variable bitrate works in audio compression, focusing on retaining quality where it matters most.

The Budget Controller is designed to dynamically allocate different compression ratios to various components of a prompt, such as instructions, demonstrations, and questions. This allocation occurs at the sentence or demonstration level, with two key considerations:

Influence of Instruction and Question: Instructions and questions in a prompt directly influence the generated results, as they should contain all necessary knowledge to generate the following answer. These components are thus given priority in terms of preserving their content during compression.
Handling Redundant Demonstrations: If a prompt contains multiple demonstrations or few-shot examples, there's a possibility of redundant information. The Budget Controller addresses this by potentially allocating a smaller budget (i.e., a higher compression ratio) for demonstrations, as they might not all be necessary to achieve the desired outcome.

This approach to coarse-grained compression involves using a small language model, such as GPT-2 or LLaMA to calculate the perplexity of each demonstration, helping to determine which parts of the prompt are most essential and how to compress them effectively while maintaining the overall integrity and effectiveness of the prompt. Demonstrations are then selected in descending order of their perplexity values. The idea is to prioritize demonstrations that are more complex or informative (as indicated by higher perplexity) for retention in the compressed prompt.

Iterative Token-Level Prompt Compression:

This is a step-by-step process where the prompt is broken down into smaller pieces, and each piece is compressed carefully. This iterative process breaks down the prompt into segments, compressing each segment while maintaining the semantic integrity. This technique addresses the challenge of preserving the contextual relationship between tokens, akin to ensuring that key frequencies are retained in audio compression.

The Iterative Token-level Prompt Compression (ITPC) algorithm in the paper works as follows:

Segmentation of the Prompt: The algorithm starts by dividing the target prompt into several segments. This segmentation is done after an initial coarse-grained demonstration-level compression has been applied. A small language model, like GPT-2 or LLaMA, is used to compute the perplexity of each segment in the prompt. This step is crucial as it helps in understanding the complexity and information density of different parts of the prompt.
Iterative Compression Process: The algorithm then iteratively works through these segments. For each segment, it calculates the conditional probabilities and determines a compression threshold.
Preservation of Semantic Integrity: The ITPC method aims to maintain the semantic integrity of the prompt, even under high compression ratios. It does this by carefully selecting which tokens to compress based on their calculated probabilities and information content.

This iterative process ensures that the final compressed prompt retains the essential information and structure required for the large language model to understand and respond effectively, while significantly reducing the size of the input.

Pascal Biese 2 个月前

How to get more out of LLMs

Stefan Huyghe 1 年前

SLM and LLM... My Top 10 in July 2024

?? Fabrizio Degni 3 个月前

Distribution Alignment:

The paper introduces a concept known as "Distribution Alignment". This concept is a key step in bridging the gap between the compressed prompt and the expectations of the language model. When compressing prompts, there is a risk that the reduced version might not align well with the distribution patterns the language model is accustomed to. This misalignment can lead to inefficiencies or inaccuracies in how the model processes the compressed prompt.

To address this, the paper proposes a method to align the distribution of the compressed prompt with that of the language model. This is achieved through 'instruction tuning,' a process where a pre-trained small language model is instruction-tuned using data generated by the larger language model.

By aligning the distributions, the compressed prompts are better understood and processed by the language model. This alignment is essential for maintaining the effectiveness of the compression, ensuring that the language model continues to generate accurate and contextually relevant responses, despite the reduced prompt size.

Findings and Implications of Prompt Compression Techniques

The paper presents results demonstrating the effectiveness of these prompt compression techniques in large language models like GPT-4. Key findings include:

Efficient Compression: The methodologies significantly reduce prompt length without compromising the quality of responses from the language model.
Practical Application: The compressed prompts lead to faster response times and reduced computational load, making these language models more practical for real-world applications.
Future Potential: These techniques open up possibilities for more complex applications of language models, where prompt length and computational efficiency are critical factors.

The paper concludes that prompt compression is not just a technical achievement but a necessary step towards making advanced language models more accessible and usable in diverse settings.

Conclusion

In conclusion, the paper is a key first step in the field of prompt compression. By effectively reducing prompt length while maintaining the integrity of information, this research paves the way for more efficient and cost-effective use of LMMs. The methodologies proposed offer a baseline for future research and applications, highlighting the importance of data efficiency in the ever-evolving landscape of Generative AI.

Paper: https://arxiv.org/pdf/2310.05736.pdf

GitHub: https://github.com/microsoft/LLMLingua

Researchers: Huiqiang Jiang , Qianhui Wu , Chin-Yew Lin , Yuqing Yang , Lili Qiu

Wojciech Ozimek

6 个月

Thanks. We are looking for prompt compression for multimodal prompts (asking the same question 1000+ times - interpreting photo library). The architecture/process looks like a good fit.

Chris Mann

AI Product Management. Former LinkedIn, IBM, Bizo, 1Password and several 0-1's. [I am NOT looking for marketing or development - engineering services]

6 个月

Killer topic Ashish I've been researching the larger topic of cost reduction / management and I wonder if this one as you get into deeper techniques becomes troublesome to manage. I think all of this will be troublesome to manage! Moving from one model to the next on its own can cause bad results and or the need to rewrite your prompt for the new model. Model routing is a cool idea yet now you have to loop through your prompt engineering say three times one for each of the new three models that potentially will handle the request to make sure the prompts work properly at each model where previously it was all handled by one GPT4 prompt. Then you get into these different compression techniques and you get similar new complexities. This is a great article that you've written brother!

1 次回应

James V Baber

AI Product & Technology Leadership

9 个月

I like the concept and appreciate the depth of research (the paper has a lot of formulas!). I hope this can be a stepping stone towards better and less expensive RAG model outputs? For business application, I remain challenged with solutions that equate to "use even more models before chatting with your LLM", and "oh, by the way, be sure to take the time and effort to train a small model as well". The authors used the most expensive LLM (GPT-4) for prompt compression. If you're going to send the full prompt to GPT-4 and pay the token cost already, why take further steps? Furthermore, prompt compression introduces far more variability in outcomes, almost always for the worse. This is mathematically probable due to fidelity loss. While academically interesting, these solutions seem to be less practical for business applications, but perhaps they do add to the domain of knowledge. Not sure what to do with this one with my clients, but I appreciate the summary.

2 次回应

Matthew Meyer

Director - Technology Evangelist | IT Leader & Innovator | AI Solutions Architect

9 个月

Very interesting. Starts to look like JavaScript minimization.

1 次回应

Hadrien-Nessim Socard

Global Digital Workplace Director @ SEPHORA - Microsoft MVP, Modern Work, Copilots and Low Code

9 个月

Romain CHAUSSEDOUX

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Prompt Compression in Large Language Models

Ashish Bhatia

Product Manager @ Microsoft

Introduction

Methodologies in Prompt Compression

Budget Controller:

Iterative Token-Level Prompt Compression:

领英推荐

Distribution Alignment:

Findings and Implications of Prompt Compression Techniques

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Demystifying the Building Blocks: A Look Inside LLMs

On-Device LLM - Future is EDGE AI

How to prompt like a pro: Why do different language models react differently?

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Training, Tuning, and Retrieval: How Large Language Models Get Smart

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Most Companies Use LLMs Wrong. Here’s Why

Introduction

Methodologies in Prompt Compression

Budget Controller:

Iterative Token-Level Prompt Compression:

领英推荐

Distribution Alignment:

Findings and Implications of Prompt Compression Techniques

Conclusion

Building natural language actions in Copilot Studio

2024年5月22日

Voice is the New User Experience

2024年5月19日

How Instruction Hierarchy can Enhance LLM Safety and Functionality

2024年5月6日

A Simple LLM Fine-Tuning with LoRA Guide for Citizen Developers

2024年3月29日

Chapter 1: AI Agents and Agentic Behavior

2024年3月8日

Agent AI systems - Another step towards AGI

2024年2月14日

Do You Feel the AI Guilt? But Why?

2024年2月4日

AI's Exponential Journey: Milestones to AGI and Beyond

2024年1月21日

Sleeper Agents: The Persistence of Deceptive Behavior in LLMs

2024年1月18日

Tackling LLM Vulnerabilities to Indirect Prompt Injection Attacks

2024年1月12日

社区洞察

其他会员也浏览了

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Demystifying the Building Blocks: A Look Inside LLMs

On-Device LLM - Future is EDGE AI

How to prompt like a pro: Why do different language models react differently?

How Large Language Models (LLMs) Work and How They Are Developed

Large Language Models

Unlocking the Power of Retrieval-Augmented Generation (RAG) in the Age of Long-Context Language Models: A Critical Perspective

Training, Tuning, and Retrieval: How Large Language Models Get Smart

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Most Companies Use LLMs Wrong. Here’s Why