Evolution of Prompt Engineering
Intention—The past quarter has been nothing short of phenomenal, with Microsoft announcing game-changing advances in productivity tools, Google hot on their heels, and OpenAI pushing the boundaries of common practices around Large Language Models (LLMs). This article serves as a primer on how LLM prompts have evolved into a new programming language, revolutionizing the way we interact with computers.
NOTE: This post is LIVE, meaning I will be updating it every now and then with new prompting techniques that I find interesting.
Simple prompts
Zero and few-shot prompting: Zero-shot prompting: In a zero-shot prompt, we ask the model a direct question without providing any examples of how the task should be performed. The LLM is expected to generate a relevant answer based on the context provided. An example prompt is:
Text: i'll bet the video game is a lot more fun than the film.
Sentiment: [The model is expected to provide the sentiment for the text.]
The GPT model, being a completion model, is expected to provide the sentiment for the text provided.?
To improve the performance of zero-shot prompting, we can provide the model with a small number of examples, or "shots." These examples demonstrate how the task should be performed, and the model is expected to generalize from these examples to provide an accurate response. However, few-shot prompting for GPT-3 has shown to suffer from various biases, such as majority label bias, recency bias, and common token bias. Choosing diverse samples?(or this one) can help mitigate these issues.
Text: (lawrence bounces) all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place.
Sentiment: positive
Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative
Text: for the first time in years, de niro digs deep emotionally, perhaps because he's been stirred by the powerful work of his co-stars.
Sentiment: positive
Text: i'll bet the video game is a lot more fun than the film.
Sentiment:
Given the examples of sentiment estimations, the model is expected to "understand" that we are asking for the sentiment of the texts provided.
Instruct LM: While few-shot prompting can be effective, it consumes tokens and may limit the input length due to context restrictions. An alternative is to use direct instructions, which require fewer tokens while still clearly conveying the task. Instructed Language Models (e.g., InstructGPT?or?natural instruction) are fine-tuned on high-quality sets of task instructions, inputs, and correct outputs. This helps the model better understand user intentions and adhere to the given instructions. A popular technique for training such models is Reinforcement Learning from Human Feedback (RLHF). When working with instructed models, it's important to be specific, precise, and focus on instructing the model on what it should do, rather than what it should not do. An example prompt would be:
Please label the sentiment towards the movie of the given movie review. The sentiment label should be "positive" or "negative".
Text: i'll bet the video game is a lot more fun than the film.
Sentiment:
Summarization and fact extraction are some examples of instruct?prompting. By providing a clear task description and context, simple prompts allow LLMs to generate useful outputs for various applications, such as sentiment analysis, summarization, and fact extraction. This approach provided a major improvement and started a wave of industry attraction toward language models.?Of course, these use cases come with their own challenges such as hallucination, trust, and more.
Chain-of-Thought Prompting
Chain-of-Thought Prompting is a method that significantly improves the ability of large language models (LLMs) to perform complex reasoning tasks. By providing a few chain-of-thought demonstrations as exemplars (in few-shot form) during the prompting process, the LLM is guided to think step-by-step and reason more effectively. This method has been shown to enhance performance on arithmetic, common-sense, and symbolic reasoning tasks. It also achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even fine-tuned GPT-3 models with a verifier.
Example of a Chain-of-Thought Prompt:
Step 1: Read the problem: "John has 8 apples. He gives 3 apples to Mary. How many apples does John have left?
Step 2: Calculate the difference between the initial number of apples and the apples given away: 8 - 3 = 5
Step 3: Answer the question: John has 5 apples left."
In this example, the CoT prompt guides the LLM through the steps needed to solve the problem, enabling the model to reason more effectively and generate a correct answer.
Example:
The main advantage of Chain-of-Thought Prompting is that it teaches LLMs to follow a structured thought process, which helps the model better understand and reason about the input. This leads to improved performance on various tasks and makes the LLM more useful for a wide range of applications.
However, it's important to note that CoT prompting may still face challenges, such as hallucination, trust, and error propagation, which can affect the accuracy and reliability of the generated output.
Self-ask
Self-Ask Prompting is an elicitive prompting technique that aims to improve the compositional reasoning abilities of large language models (LLMs) by encouraging the model to ask itself follow-up questions. This method addresses the "compositionality gap," which refers to the fact that LLMs' single-hop performance tends to improve more rapidly than multi-hop performance. As a result, the compositionality gap remains unchanged as the models increase in size and complexity.
Researchers propose using self-ask techniques to narrow this gap and demonstrate improved accuracy when the LLM uses a search engine to ask itself follow-up questions. The idea is to guide the model to think more deeply and critically about the problem at hand, ultimately enhancing its reasoning capabilities.
The main advantage of Self-Ask prompting is that it encourages LLMs to dig deeper and reason more effectively by asking themselves relevant questions. This leads to improved performance on multi-hop reasoning tasks and makes the LLM more useful for a variety of applications.
Code and demo:?https://github.com/ofirpress/self-ask
ReAct
ReAct, which stands for Reasoning and Acting, is a novel prompting technique that interleaves the generation of reasoning traces and task-specific actions within large language models (LLMs). By synergizing reasoning and acting processes, ReAct enables LLMs to perform more effectively on language and decision-making tasks, outperforming state-of-the-art baselines and improving human interpretability and trustworthiness over methods without reasoning or acting components.
ReAct operates by structuring the LLM's thought process into distinct steps, including:
An example prompt for ReAct is:
Solve a question answering task with interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action can be three types:
(1) Search[entity], which searches the exact entity on Wikipedia and returns the first paragraph if it exists. If not, it will return some similar entities to search.
(2) Lookup[keyword], which returns the next sentence containing keyword in the current passage.
(3) Finish[answer], which returns the answer and finishes the task.?
ReAct helps LLMs overcome issues like hallucination and error propagation, which are prevalent in chain-of-thought reasoning. By generating human-like task-solving trajectories, ReAct outperforms imitation and reinforcement learning methods on two interactive decision-making benchmarks.
The main advantage of ReAct prompting is that it provides LLMs with a systematic way of thinking and acting, making them more effective in solving complex tasks. However, it's essential to consider that ReAct might still face challenges, such as generating suboptimal reasoning traces or failing to improve performance on specific tasks. Additionally, the effectiveness of this technique could vary depending on the model's size and architecture.
领英推荐
Code and demo: ReAct: Synergizing Reasoning and Acting in Language Models
Reflexion
Reflexion is a prompting approach that aims to equip an agent with dynamic memory and self-reflection capabilities, which can enhance its reasoning trace and task-specific action choices. Inspired by the way humans use self-reflection to solve novel problems through trial and error, Reflexion allows the agent to learn from its past experiences and adapt its strategy accordingly.
The primary components of Reflexion include:
Reflexion example prompt:
You will be given the history of a past experience in which you were placed in an environment and given a task to complete.
You were unsuccessful in completing the task. Do not summarize your environment, but rather think about the strategy and path you took to attempt to complete the task.
Devise a concise, new plan of action that accounts for your mistake with reference to specific actions that you should have taken.
For example, if you tried A and B but forgot C, then devise a plan to achieve C with environment-specific actions. You will need this later when you are solving the same task.
Give your plan after "Plan".
Here are two examples: XXX
Reflexion has been evaluated on decision-making tasks in AlfWorld environments and knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments. The approach achieved success rates of 97% and 51%, respectively, demonstrating its effectiveness in a variety of settings.
Reflexion offers several advantages, including the ability to learn from past experiences, adapt strategies dynamically, and improve performance on complex tasks. However, it's important to consider the potential limitations of this approach, such as the reliance on heuristics to identify hallucination instances and the possibility of suboptimal action sequences. The effectiveness of Reflexion may also vary depending on the model's size and architecture.
Language Models can Solve Computer Tasks
RCI is a prompting approach that encourages a Large Language Model (LLM) to iteratively criticize and improve its own decisions. This method enhances the LLM's ability to execute computer tasks guided by natural language instructions and boosts its reasoning capabilities.
The key components of RCI include:
Solve the following task by iteratively improving your solution: [Task description]
Step 1: Propose an initial solution.
Step 2: Criticize your initial solution and suggest improvements.
Step 3: Apply the suggested improvements and propose a revised solution.
Step 4: Repeat steps 2 and 3 until a satisfactory solution is reached.
RCI has been shown to significantly outperform existing LLM methods for automating computer tasks, surpassing supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. In addition, RCI is effective in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting. When RCI is combined with CoT, the performance surpasses that of either method alone.
RCI offers several advantages, such as enabling LLMs to execute computer tasks more effectively, improving reasoning capabilities, and allowing for iterative refinement of solutions. However, it is essential to consider the potential limitations of this approach, such as the reliance on the model's ability to generate accurate criticism and the possibility of getting stuck in a loop of improvements without convergence. The effectiveness of RCI may also depend on the model's size and architecture.
Project and demo:?https://posgnu.github.io/rci-web/
Large Language Models Are Human-Level Prompt Engineers
APE is a method for automatically generating and selecting instructions (prompts) to maximize the performance of Large Language Models (LLMs) on specific tasks. It treats the instruction as a "program" and optimizes it by searching over a pool of candidate prompts. APE's primary goal is to develop instructions that help the LLM better understand the user's intent and provide more accurate and relevant responses.
The key components of APE include:
APE has demonstrated success in automatically generating instructions that outperform the prior LLM baseline by a large margin. It has achieved better or comparable performance to instructions generated by human annotators on 24 out of 24 Instruction Induction tasks and 17 out of 21 curated BIG-Bench tasks.
APE offers several advantages, such as automating the prompt engineering process, improving LLM performance, and reducing the need for human involvement in instruction generation. However, there are potential limitations, such as the computational cost of generating and evaluating multiple prompts, and the possibility that the model may not produce highly diverse or creative prompts. The effectiveness of APE may also depend on the model's size and architecture, as well as the specific task being optimized.
Code and demo: APE
Conclusion:
As the field of LLMs continues to grow, we can expect even more advances in prompting techniques, further revolutionizing our interaction with computers and AI systems. Stay tuned for updates on this exciting frontier.
Some further reads:
Product Manager Applied AI at Microsoft
1 年Interesting how you write that the LLMs think and reason. Is that an expression of that we don’t understand or are unable to express (due to lack of vocabulary) what they really do?
This is great Reza! I look forward to seeing how it develops.