Full Fine-Tuning, PEFT, Prompt Engineering, and RAG: Which One Is Right for You?
Deci AI (Acquired by NVIDIA)
Deci enables deep learning to live up to its true potential by using AI to build better AI.
This blog was first published here.
Introduction?
Since the introduction of ChatGPT, businesses everywhere have eagerly sought to leverage Large Language Models (LLMs) to elevate their operations and offerings. The potential of LLMs is vast, but there’s a catch: even the most powerful pretrained LLM might not always meet your specific needs right out of the box. Here’s why:
So, how do you make an LLM align with your unique requirements? You’ll likely need to tweak or “tune” it. Currently, there are four prominent tuning methods:
They vary in expertise needed, cost, and suitability for different scenarios. This article explores each, shedding light on their nuances, costs, and optimal use cases. Dive in to discover which method is the best fit for your project.
Full fine-tuning
Fine-tuning is the process we use to further train an already pre-trained LLM on a smaller, task-specific, labeled dataset. In this way, we adjust some of the model parameters to optimize its performance for a particular task or set of tasks. In?full fine-tuning, all the model parameters are updated, making it similar to pretraining—just that it’s done on a labeled and much smaller dataset.
Full fine-tuning in 6 steps
To illustrate, let’s say we want to build a tool that generates abstracts for biotechnology research papers. For full fine-tuning, you’ll need to go through the following steps: 1. Create the dataset?
2. Preprocess the data
3. Configure the model:?
4. Train the model:
5. Evaluate performance:?
6. Iterate until performance is satisfactory:?
What makes full fine-tuning worthwhile?
Requires less data than training from scratch
Full fine-tuning can be effective even with a relatively small task-specific dataset. The pretrained LLM already understands general language constructs. The fine-tuning process focuses primarily on adjusting the model’s knowledge to the specificities of the new data. A pretrained LLM, initially trained on roughly 1 trillion tokens and demonstrating solid general performance, can be efficiently fine-tuned with just a few hundred examples, equivalent to several hundred thousand tokens.
Enhanced accuracy
By fine-tuning on a task-specific dataset, the LLM can grasp the nuances of that particular domain. This is especially vital in areas with specialized jargon, concepts, or structures, such as legal documents, medical texts, or financial reports. As a result, when faced with unseen examples from the specific domain or task, the model is likely to make predictions or generate outputs with higher accuracy and relevance.?
Increased robustness?
Fine-tuning allows us to expose the model to more examples, especially edge cases or less common scenarios in the domain-specific dataset. This makes the model better equipped to handle a wider variety of inputs without producing erroneous outputs.??
The downside of full fine-tuning?
High computational costs?
Full fine-tuning involves updating all the parameters of a large model. With large-scale LLMs boasting tens or hundreds of billions of parameters, training requires enormous amounts of computing power. Even when the fine-tuning dataset is relatively small, the number of tokens can be huge and expensive to compute.?
Substantial memory requirements
Working with large models can necessitate specialized hardware, such as high-end GPUs or TPUs, with significant memory capacities. This is often impractical for many businesses.
Time & expertise intensive?
When the model is very large, you often need to distribute the computation over multiple GPUs and nodes. This requires appropriate expertise. Depending on the size of the model and the dataset, fine-tuning can take hours, days, or even weeks.
Parameter-efficient fine-tuning
Parameter-efficient fine-tuning?(PEFT) uses techniques to further tune a pretrained model by updating only a small number of its total parameters. A large language model pretrained on vast amounts of data has already learned a broad spectrum of language constructs and knowledge. In short, it already possesses much of the information needed for many tasks. Given the smaller scope, it’s often unnecessary and inefficient to adjust the entire model. The fine-tuning process is conducted on a small set of parameters.?
PEFT methods vary in their approach to determining which components of the model are trainable. Some techniques prioritize training select portions of the original model’s parameters. Others integrate and train smaller additional components, like adapter layers, without modifying the original structure.?
LoRA
LoRA, short for Low-Rank Adaptation of Large Language Models, was introduced in early 2023. It has since become the most commonly used PEFT method, helping companies and researchers reduce their fine-tuning costs. Using reparameterization, this technique downsizes the set of trainable parameters by performing low-rank approximation.?
For example, if we have a 100,000 x 100,000 weight matrix, then for full fine-tuning we need to update 10,000,000,000 parameters.? Using LoRA, we can capture all or most of the crucial information by using a low-rank matrix that contains the selected parameters updated during fine-tuning.
To get this low-rank matrix, we can reparametrize the original weight matrix into two matrices, A and B, each of low rank r. Our new low-rank matrix is then taken to be the product of A and B. If r = 2, we end up updating (100,000 x 2) + (100,000 x 2) = 400,000 parameters instead of 10,000,000. By updating a much smaller number of parameters, we reduce the computational and memory requirements needed for fine-tuning. The following graphic from the original paper on LoRA illustrates the technique.?
The following are some of the advantages of LoRA:?
Because it’s so new, LoRA’s efficacy in situations where you need to fine-tune a model on multiple tasks is still untested. In such situations, the pre-trained LLM needs to be fine-tuned on each task sequentially, and it remains to be seen whether LoRA can maintain the accuracy of full fine-tuning. For a step-by-step guide on using LoRA to fine-tune Llama 2 7B, read our technical?blog post.
Where PEFT beats full fine-tuning
More efficient and faster training?– Fewer parameter tweaks mean fewer calculations, which translates directly into speedier training sessions that require less computing power and fewer memory resources. This makes fine-tuning in resource-tight scenarios more practical.
Preserving knowledge from pretraining?– Extensive pre-training on broad datasets packs models with invaluable general knowledge and capabilities. With PEFT, we ensure that this treasure trove isn’t lost when adapting the model to novel tasks because we retain most or all of the original model weights.?
Whether or not PEFT is an effective alternative to full fine-tuning depends on the use case and the particular PEFT technique selected. In PEFT you train a much smaller number of parameters than in full fine-tuning and, if the task is “hard enough,” the difference in the number of parameters trained will show.????
Prompt engineering?
The methods discussed so far involve training model parameters on a new dataset and task, using either all of the pretrained weights, as in full fine tuning, or a separate set of weights as in LoRA.? In contrast, prompt engineering doesn’t involve training network weights at all. It is the process of designing and refining the input given to a model to guide and influence the kind of output you want.
Basic Prompting Techniques
Very large LLMs such as GPT4 are tuned to follow instructions, can generalize from very few examples based on the diverse patterns they’ve seen during training, and exhibit basic reasoning abilities. Prompt engineering leverages these capabilities to elicit desired responses from the model.?
Zero-shot prompting
In zero-shot prompting, we prepend a certain instruction to the user’s query without providing the model with?any?direct examples.
Imagine you’re developing a tech support chatbot using a large language model. To make sure the model focuses on providing tech solutions without having prior examples, you can prepend a specific instruction to all user inputs:
Prompt:?
Provide a tech support solution based on the following user concern. User concern: My computer won't turn on.
Solution:
By prepending an instruction to the user query “My computer won’t turn on,” we give the model context for the kind of answer desired. This is a way of adapting its output for tech support even without explicit examples of tech solutions.
Few-shot prompting
In few-shot prompting, we prepend a few examples to the user’s query. These examples are essentially pairs of sample input and expected model output.?
领英推荐
Imagine creating a health app that categorizes dishes into ‘Low Fat’ or ‘High Fat’ using a language model. To orient the model, a couple of examples are prepended to the user query:
Classify the following dish based on its fat content: Grilled chicken, lemon, herbs. Response: Low Fat
Classify the following dish based on its fat content: Mac and cheese with heavy cream and butter. Response: High Fat
Classify the following dish based on its fat content: Avocado toast with olive oil
Response:
Informed by the examples in the prompt, a large enough and well trained LLM will reliably respond: “High Fat.”?
Few-shot prompting is a good way of getting the model to adopt a certain response format. Going back to our tech support app example, if we wanted the model’s response to conform to a certain structure or length restrictions, we could do so through few-shot prompting.
Chain-of-thought prompting
Chain-of-thought prompting allows for detailed problem-solving by guiding the model through intermediate steps. Pairing it with few-shot prompting can enhance performance on tasks that need thoughtful analysis before generating an answer.
Subtracting the smallest number from the largest in this group results in an even number: 5, 8, 9.
A: Subtracting 5 from 9 gives 4. The answer is True.
Subtracting the smallest number from the largest in this group results in an even number: 10, 15, 20.
A: Subtracting 10 from 20 gives 10. The answer is True.
Subtracting the smallest number from the largest in this group results in an even number: 7, 12, 15.
A:
In fact, chain of thought prompting can also be paired with zero shot prompting to enhance performance on tasks that require step-by-step analysis. Going back to our tech support app example, if we wanted to improve the model’s performance, we could ask it to break down the solution step by step.
Break down the tech support solution step by step based on the following user concern. User concern: My computer won't turn on.
Solution:
For a variety of applications, basic prompt engineering of a very large LLM can deliver ‘good enough’ accuracy. It provides an economical adaptation method because it is fast and doesn’t involve large amounts of computing power. The downside is that it’s simply not accurate or robust enough for use cases additional background knowledge is required.
Retrieval Augmented Generation (RAG)?
Introduced by Meta researchers,?Retrieval Augmented Generation?(RAG) is a powerful technique that combines prompt engineering with context retrieval from external data sources to improve the performance and relevance of LLMs. By grounding the model on additional information, it allows for more accurate and context-aware responses.
How does RAG work?
RAG essentially couples information retrieval mechanisms with text generation models. The information retrieval component helps pull in relevant contextual information from a database, and the text generation model uses this added context to produce a more accurate and “knowledgeable” response. Here’s how it works:??
For a practical guide to using RAG, we recommend our?technical blog?on how to build a RAG system using?LangChain.
RAG use cases?
RAG is especially useful when an application requires the LLM to base its responses on a large set of documents that are specific to the application’s context. These applications can include a wide variety of familiar tasks. Examples include a technical support chatbot that pulls from the company’s instruction manuals and technical documents to answer customer questions and an internal question and answer app that has access to an enterprise’s internal documentation and can supply answers based on these documents.?
RAG is also useful when the application needs to use up-to-date information and documents that weren’t a part of the LLMs training set. Some examples might be news databases or applications that search for medical research associated with new treatments.
Simple prompt engineering can’t handle these cases because of the limited context window of the LLM. Currently, for most use cases you can’t feed the entire set of documents into the prompt of the LLM.? Advantages of RAG
RAG has a number of distinct advantages:
Minimizes hallucinations?– When the model makes ‘best-guess’ assumptions, essentially filling in what it ‘doesn’t know’, the output can be incorrect or pure nonsense. Compared to simple prompt engineering, RAG produces results that are more accurate with a lower-chance of hallucinations.?
Easily adapts to new data –?RAG can adapt in situations where facts could evolve over time, making it useful for generating responses that require up-to-date information.
Interpretable –?Using?RAG, the source of the LLM’s answer can be pinpointed. Having traceability regarding the source of an answer can be beneficial for internal monitoring, quality assurance, or addressing customer disputes.?
Cost effective –?Instead of fine-tuning the entire model on the task-specific dataset, you can get comparable results with RAG, which involves far less labeled data and computing resources.
RAG’s potential limitations
RAG is designed to enhance an LLM’s information retrieval capabilities by drawing context from external documents. However, in certain use cases additional context is not enough. If a pretrained LLM struggles with summarizing financial data, or drawing insights from a patient’s medical documentation, it is hard to see how additional context in the form of a single document could help. In such a case, fine-tuning is more likely to result in the desired output.
Choosing the right approach for your application
After reviewing the four approaches to LLM adaptation, let’s compare them on three important metrics: complexity, cost, and accuracy.?
Cost
When measuring the cost of an approach, it makes sense to take into consideration the cost of its initial implementation, along with the cost of maintaining the solution. Given that, let’s compare the costs of our four approaches.?
Prompt engineering –?Prompt engineering has the lowest cost of the four approaches. It boils down to writing and testing prompts to find ones that deliver good results when fed to the pretrained LLM. It may also involve updating prompts if the pretrained model is itself updated or replaced. This can happen periodically when using a commercial model like OpenAI’s GPT4.
RAG?– The cost of implementing RAG may be higher than that of prompt engineering. This is because of the need for multiple components: embedding model, vector store, vector store retriever, and pretrained LLM.
PEFT:?The cost of PEFT tends to be higher than that of RAG. This is because fine-tuning, even efficient fine-tuning, requires a considerable amount of computing power, time, and ML expertise. Moreover, to maintain this approach, you’ll need to fine-tune periodically to incorporate new relevant data into the model.?
Full fine-tuning?– This method is significantly more costly than PEFT, given that it requires even more compute power and time.
Complexity of implementation
From the relatively straightforward approach of prompt engineering to the more intricate configurations of RAG and advanced tuning methods, the complexity can vary significantly. Here’s a quick rundown on what each method entails:
Prompt engineering?– This method has relatively low implementation complexity. It requires little to no programming. To draft a good prompt and conduct experiments, a prompt engineer needs good language skills, domain expertise, and familiarity with few-shot learning approaches.?
RAG?– This approach has higher implementation complexity than prompt engineering. To implement this solution, you need coding and architecture skills. Depending on the RAG components chosen, the complexity could run very high.??
PEFT and Full fine-tuning?– These approaches are the most complex to implement. They demand a deep understanding of deep learning and NLP and expertise in data science to change the model’s weights via tuning scripts. You’ll also need to consider factors such as the training data, learning rate, loss function, etc.?
Accuracy
Evaluating the accuracy of different approaches for LLM adaptation can be intricate, especially because the accuracy often hinges on a blend of distinct metrics. The significance of these metrics can fluctuate based on the specific use case. Certain applications might prioritize domain-specific jargon. Others might prioritize the ability to trace the model’s response back to a particular source. To find the most accurate approach for your needs, it’s imperative to identify the pertinent accuracy metrics for your application and compare the methodologies against those specific criteria. Let’s take a look at some accuracy metrics.
Domain-specific terminology
Fine-tuning can be used to effectively impart LLMs with domain-specific terminology. While RAG is proficient in data retrieval, it may not capture domain-specific patterns, vocabulary and nuances as well as a fine-tuned model. For tasks seeking strong domain affinity, fine-tuning is the way to go.
Up-to-date responses
A fine-tuned LLM becomes a fixed snapshot of its training dataset and will need regular retraining for data that is evolving. This makes exclusive fine-tuning (both full and PEFT) a less viable approach for applications that require responses to be synced with a dynamic pool of information. In contrast, RAG’s external queries can ensure updated responses, making it ideal for environments with dynamic data.
Transparency and Interpretability
For some applications, understanding the model’s decision-making is crucial. While fine-tuning functions more like a ‘black box’, obscuring its reasoning, RAG provides clearer insight. Its two-step process identifies the documents it retrieves, enhancing user trust and comprehension.
Hallucinations?
Pretrained LLMs sometimes make up answers that are missing from their training data or the provided input. Fine-tuning can reduce these hallucinations by focusing an LLM on domain specific data. However, unfamiliar queries may still cause the LLM to concoct a made up answer. RAG reduces hallucinations by anchoring the LLM’s response in the retrieved documents. The initial retrieval step essentially fact checks while the subsequent generation is restricted to the context of the retrieved data. For tasks where avoiding hallucinations is paramount, RAG is recommended.?
We see that RAG is excellent for cases where interpretability, up-to-date responses, and avoiding hallucinations are paramount. Full fine-tuning and PEFT are clear winners for use cases that put most of the weight on domain-specific style and vocabulary. But what if your use case requires both? In that case, you might want to consider a hybrid approach, using both fine-tuning and RAG.
The road ahead to LLM adaptation
To determine the best approach for adapting an LLM to your needs, you’ll need to go beyond budgetary constraints and available expertise, and delve into the specific requirements of the application. What aspects of LLM accuracy hold the highest significance for you? Is preventing hallucinations paramount, or do you place greater value on the LLM’s creative capabilities? How vital is it for your LLM to stay updated with fresh data, and is this something you can achieve with prompt modifications—like updating interest rates—or does the frequency or complexity of the information demand a strategy like retrieval augmented generation? Could a single adaptation strategy be sufficient, or might a combination serve you better?
Now that you know what questions need answering, all that remains is to try and answer them and navigate your optimal course of action.
Interested in learning how to supercharge the performance of your LLMs, boosting speed by up to 5x while maintaining the same accuracy? Read here.