Precision in Prompting: Key to Effective LLM Interactions
Introduction
My previous article explored various methods to loading and performing local inference with Large Language Models (LLM). This time, our attention turns to the art of presenting prompts to LLMs. Fine-tuning Language Learning Models (LLMs) brings a focus on the critical importance of prompt formatting. These specialized models, unlike their general-purpose counterparts, may depend heavily on the use of specific keywords and carefully structured prompts to elicit the right responses. This article delves into why getting the prompt syntax just right matters so much, highlighting how even minor deviations can result in less-than-ideal outcomes. We'll also explore the role of templating languages, such as jinja, in achieving consistent prompt formatting and show how these concepts come to life through practical examples using llama.cpp python. This is different than prompt engineering where the focus is on the content of the text itself while prompt formatting which is the intent of this article, is about it's presentation to the model.
Pretrained vs. Instruction-Tuned Models
In machine learning, particularly for language models like GPT, we distinguish between "pretrained models" and "fine-tuned models" during their development and deployment.
Pretrained Models refer to the initial versions trained on diverse datasets to grasp language fundamentals, including grammar, syntax, and some world knowledge. This phase is mostly unsupervised, with the model learning from vast amounts of text to build a broad understanding. Pretrained models are also known as foundational model or base model.
Fine-Tuned Models are developed by further training the pretrained model on a more focused dataset. This second phase, often supervised, tailors the model to specific tasks like emotion analysis or medical inquiries, refining its capabilities to perform particular functions with greater accuracy. Fine-tuning transforms a generalist model into a specialist, leveraging its foundational knowledge for precise applications. This shift is key to developing AI solutions that meet specific needs with enhanced precision and efficiency. While not all fine-tuned LLM models require specific prompt syntax, however it's essential to confirm before using them whether they expect prompt inputs in a particular format or if plain, regular text will be sufficient.
Why Presenting the Right Prompt Matters
When working with AI language models, the approach to framing questions or tasks is crucial and comes down to two key aspects: the content of the prompt and its presentation. Prompt engineering focuses on crafting the text with precise wording to elicit accurate responses. Equally important, but less frequently discussed, is the method of presenting these prompts, involving the use of special symbols or formats to distinguish between the task instructions and contextual information. This structured approach is essential for fine-tuning the models, enabling them to understand and respond more effectively.
Training Large Language Models (LLMs) to recognize specific formats or keywords significantly boosts their capacity for handling specialized tasks and understanding context that might elude broader models. Maintaining the correct prompt format is crucial because deviations can impact output quality. For example, using markers like [INST] in prompts for models such as LLaMA-2 clarifies our requests by distinguishing different prompt parts, thereby enhancing the model's response accuracy and relevance. This approach is akin to using a map to provide clear directions; the text serves as the directions, while the format acts as the map guiding the interpretation. Understanding this, is vital for developers, as it not only improves model performance but also expands the potential for innovative AI applications. Experimentation with prompt presentation continues to unveil new ways to refine our interactions with AI, marking an exciting frontier for developers aiming to unlock further capabilities of these advanced technologies.
Prompt Generation
Consider the llama2 chat model that has been fine-tuned to interact in a chat format, involving a user and an assistant. In contrast, the pre-trained llama2 model could only predict or generate the next piece of text based on the provided input and don't have the capability to understand or respond to user queries. However, the chat version of this model, built upon the foundational model, undergoes supervised learning. It's specially trained with a certain prompt structure to grasp questions within their context and give answers. This model uses a unique prompt format, incorporating specific keywords to mark the beginning and end of the conversation segments from both the user and the assistant.
Typically, the prompt includes a history of the conversation, containing the user's messages and the assistant's previous responses, capped off with the user's latest question or comment. This specially formatted prompt is then fed to the model, which, having been trained to recognize this structure, starts generating the assistant's response.
The Huggingface tokenizer class offers a handy method for assembling prompts in the correct format according to the model. The hugging face models often ship with prompt template in jinja template language format which can be found in the model's tokenizer_config.json file. The apply_chat_template method of tokenizer class is provided with information about the user and roles, to apply the jinja template and produce the proper prompt format for the model. This method of generating prompts is versatile and can be adapted for any model, as long as its jinja template is included in its tokenizer_config.json. The apply_chat_template method takes care of formatting and generating the final prompt output which is compatible with model and can be fed to generate output.
Instead of using tokenizer class, user can also choose to go with manual prompt creation but for that it's crucial, though, to pay attention to the placement of keywords and the punctuation around them, like spaces and newline characters, to ensure the model produces consistent outputs. Compliance with the precise prompt syntax is key for reliable results from the model.
Here is a code snippet to generate prompt for llama2 fine tuned mode for chat using apply_chat_template method.
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')
chat = [
{"role": "system", "content": "You are a helpful, respectful and honest assistant."},
{"role": "user", "content": "Hi there, write me 3 random quotes"},
]
prompt=llama_tokenizer.apply_chat_template(
chat, tokenize=False,
chat_template=tokenizer.chat_template)
print(prompt)
In the script mentioned, the tokenizer.chat_template is set up using a Jinja template from the tokenizer_config.json file. If the Jinja template is not present in this file, then chat_template will be initialized as None. In that case, users may supply a jinja template string for that prompt syntax obtained through alternative methods. If no template is available, users can opt for creating prompts manually, a process that is detailed later in the article
Llama 2 compatible prompt.
Text highlighted in yellow represents the system content, also known as context, while text in green indicates the task or user query. All other characters belong to the prompt syntax and should remain precisely as they are. Notice the use of '\n' and space just after [INST] and before [/INST] keyword. The <s> token signals the beginning of content, while [INST] and [/INST] denote the start and end of instructions, respectively. The <<SYS>> token specifies the conversation's context or the desired behavior of the model, like emulating a mathematician or musician, or setting a specific scenario that might better guide the response.
In previous code snippet replacing "meta-llama/Llama-2-7b-chat-hf" with "HuggingFaceH4/zephyr-7b-beta" will result in prompt compatible with zephyr fine tuned model.
Zephyr compatible prompt
Some other prompt formats
ChatML compatible prompt: [ Template Source ]
Mistral compatible prompt [ Template Source ]
<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]
Mixtral compatible prompt [ Template Source ]
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s>[INST] I'd like to show off how chat templating works! [/INST]"
The last two prompt for mistral and mixtral are almost the same except a space right after </s> which is not present in mixtral. Ignoring that level of details could effect end result generated by mistral or mixtral.
The Hugging Face tokenizer offers support for generating precise prompts. While it's possible to create the same syntax manually without tools, it's crucial to ensure that keywords and punctuation exactly match the required syntax for consistent results. The apply_chat_template method depends on the availability of the tokenizer_config.json file; however, the jinja format template is typically released with the model. If it's not accessible via tokenizer.chat_template, one can manually input the prompt template. Understanding the correct prompt syntax format is essential, especially for fine-tuned models, to achieve the desired output.
Manual prompt generation
Creating prompts manually without jinja is straightforward but is confined to simple question-answer scenarios. Managing an extended conversation history without jinja becomes overly complicated, which is why manual prompt creation is better suited to single-query formats. For managing longer conversation histories more efficiently, the apply_chat_template method, used in conjunction with a Jinja template within the tokenizer class, presents a superior approach.
template='<s>[INST] <<SYS>>\n{context}\n<</SYS>>\n\n{question} [/INST]'
prompt=template.format(
context="You are helpful assistant",
question="Hi there, write me 3 random quotes"
)
output=llm(prompt)
The code mentioned above employs the Python string format method to compile the final prompt, which is subsequently input into the Large Language Model (LLM) for inference.
Let's do some experiments
This section explores how slight modifications in prompt syntax can impact the output of the model. It aims to provide insight and evidence on the significance of following the correct prompt syntax closely. Please be aware that all inference results presented in this article were performed using 4-bit quantized models. It's possible that using unquantized models might yield better and more consistent results.
Zephyr 7B fine tuned model
The analysis uses the Zephyr 7B fine tuned model with 4 bit quantization in conjunction with llama-cpp-python. By invoking the model's prediction call multiple times, the script examines any variations or inconsistencies in the results which may not be visible in a single run.
from llama_cpp import Llama
import llama_cpp
model = './zephyr-7b-beta.Q4_K_M.gguf'
llm=llama_cpp.Llama(model_path=model,
verbose=True, n_gpu_layers=-1)
prompt = '<|system|>\nYou are a helpful and creative assistant.</s>\n<|user|>\nHi there, write me 3 random quotes</s>\n'
for i in range (10):
stream = llm(prompt, max_tokens=200, echo=False, temperature=0.9)
print(stream['choices'][0]['text'])
Output from one of the iteration.
Throughout all the iterations, the formatting and presentation of the results remained consistent. The contents however were changing due to temperature value. The output from all of the iterations can be found at link.
领英推荐
Let's try to remove keyword (e.g <s>\n) from the end and see what will happen
prompt = '<|system|>\nYou are a helpful and creative assistant.</s>\n<|user|>\nHi there, write me 3 random quotes'
Output from one of the iteration:
The removal of last key <s> keyword turned off chat capability and force model to behave just like standard text generation or like foundational model. Output of all iterations can be seen at link.
Now let's remove all newline characters from the original complete prompt and see what happens.
prompt = '<|system|>You are a helpful and creative assistant.</s><|user|>Hi there, write me 3 random quotes</s>'
Output from one of the iteration (see link for all the outputs):
The output varied, displaying double "<" instead of a single '<' character, and the first quote didn't start on a new line. While the generated quotes were somewhat aligned with our expectations, the presentation inaccuracies could potentially cause issues in the content produced for other prompts. This inconsistency was noted in a few instances across numerous trials, underscoring the overall unpredictable nature of the output and emphasizes the crucial importance of adhering strictly to the original prompt syntax.
Removing |user| and placing the question directly after |system| also resulted in the model delivering the correct output in the anticipated format. This suggests that using just the system keyword to present a question is effective for this model. However, this approach might not apply to other models, as each may respond differently to changes in prompt syntax.
Finally let's change the prompt to plain query without any keywords and see what we get.
prompt = 'Hi there, write me 3 random quotes'
The absence of prompt keywords results in the model's output resembling text generation (like foundational model) that picks up directly from where the prompt text concludes. Although few occurrence did output desired answer but overall lacking consistency in output. Output from all iterations can be seen at link.
Llama2 7B Chat tuned model
This time, we'll experiment with a different query using the llama2 7B chat fine-tuned model (4 bit quantized). Our focus will be on examining how the model's reasoning abilities are influenced by changes in prompt formatting. We'll compare the effects on output quality between using a properly formatted prompt and raw input, sticking exclusively to these two situations. The system context is omitted this round setting to empty string.
With proper formatted prompt:
from llama_cpp import Llama
import llama_cpp
model = './llama-2-7b-chat.Q4_K_M.gguf'
llm=llama_cpp.Llama(model_path=model,
verbose=True, n_gpu_layers=-1)
prompt="<s>[INST] <<SYS>>\n\n<</SYS>>\n\nAlex has a collection of 50 books. He decides to donate 15 books to the local library. If Alex's friend, Jamie, donates twice as many books as Alex did to the library, how many books does Jamie donate? [/INST]"
for i in range (10):
stream = llm(prompt, max_tokens=200, echo=False, temperature=0.9)
print(stream['choices'][0]['text'])
Output from one of the iteration:
Output from all 10 iterations can be found at link.
With raw prompt:
Switching the prompt from keyword-enhanced text to raw text results in a varied mix of responses, including several that lack the correct answer.
prompt = "Alex has a collection of 50 books. He decides to donate 15 books to the local library. If Alex's friend, Jamie, donates twice as many books as Alex did to the library, how many books does Jamie donate?"
Output from one of the iteration:
The llama model struggled to produce accurate results with raw text inputs, often yielding incorrect outputs. The outcomes from all 10 iterations are documented in link. However, compared to Zephyr, the llama2 model attempted to understand and respond with answer even when presented with unformatted raw prompts.
Final thoughts
Not all fine-tuned Large Language Models (LLMs) require or follow a custom prompt syntax. The need for custom prompt syntax largely depends on the specific application, the design of the fine-tuning process, and how the model is intended to be interacted with post-fine-tuning.
When Custom Prompt Syntax Is Used
Custom prompt syntax is often employed in scenarios where:
When Custom Prompt Syntax Is Not Necessary
However, custom prompt syntax is not always necessary:
The choice to use custom prompt syntax with a fine-tuned LLM model is influenced by the goals of the fine-tuning, the nature of the application, and the target users. While custom syntax can enhance precision and clarity in model interactions for certain tasks, many applications benefit from the intuitive and flexible nature of natural language interactions, requiring no special prompt syntax.
Finding prompt syntax
Gathering a detailed list of prompt syntax for open-source Large Language Models (LLMs) requires some research due to the lack of a centralized source. Key strategies include:
Identifying the LLM of interest and focusing on related resources can streamline finding applicable prompt syntax.
References
Great insights on prompt formatting for LLMs! ?? What impact do varied prompts have on outcomes? Arshad Mehmood
Co-founder & CEO ?? Making Videos that Sell SaaS ?? Explain Big Ideas & Increase Conversion Rate!
5 个月Looking forward to reading your insights on prompt formatting nuances for LLMs!