Evolution of Prompt Engineering
Generated by Dall-E

Evolution of Prompt Engineering

Intention—The past quarter has been nothing short of phenomenal, with Microsoft announcing game-changing advances in productivity tools, Google hot on their heels, and OpenAI pushing the boundaries of common practices around Large Language Models (LLMs). This article serves as a primer on how LLM prompts have evolved into a new programming language, revolutionizing the way we interact with computers.

NOTE: This post is LIVE, meaning I will be updating it every now and then with new prompting techniques that I find interesting.

Simple prompts

Zero and few-shot prompting: Zero-shot prompting: In a zero-shot prompt, we ask the model a direct question without providing any examples of how the task should be performed. The LLM is expected to generate a relevant answer based on the context provided. An example prompt is:

Text: i'll bet the video game is a lot more fun than the film.
Sentiment: [The model is expected to provide the sentiment for the text.]        

The GPT model, being a completion model, is expected to provide the sentiment for the text provided.?

To improve the performance of zero-shot prompting, we can provide the model with a small number of examples, or "shots." These examples demonstrate how the task should be performed, and the model is expected to generalize from these examples to provide an accurate response. However, few-shot prompting for GPT-3 has shown to suffer from various biases, such as majority label bias, recency bias, and common token bias. Choosing diverse samples?(or this one) can help mitigate these issues.

Text: (lawrence bounces) all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place. 
Sentiment: positive 
Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults. 
Sentiment: negative 
Text: for the first time in years, de niro digs deep emotionally, perhaps because he's been stirred by the powerful work of his co-stars. 
Sentiment: positive 
Text: i'll bet the video game is a lot more fun than the film. 
Sentiment:        

Given the examples of sentiment estimations, the model is expected to "understand" that we are asking for the sentiment of the texts provided.

Instruct LM: While few-shot prompting can be effective, it consumes tokens and may limit the input length due to context restrictions. An alternative is to use direct instructions, which require fewer tokens while still clearly conveying the task. Instructed Language Models (e.g., InstructGPT?or?natural instruction) are fine-tuned on high-quality sets of task instructions, inputs, and correct outputs. This helps the model better understand user intentions and adhere to the given instructions. A popular technique for training such models is Reinforcement Learning from Human Feedback (RLHF). When working with instructed models, it's important to be specific, precise, and focus on instructing the model on what it should do, rather than what it should not do. An example prompt would be:

Please label the sentiment towards the movie of the given movie review. The sentiment label should be "positive" or "negative". 
Text: i'll bet the video game is a lot more fun than the film. 
Sentiment:        

Summarization and fact extraction are some examples of instruct?prompting. By providing a clear task description and context, simple prompts allow LLMs to generate useful outputs for various applications, such as sentiment analysis, summarization, and fact extraction. This approach provided a major improvement and started a wave of industry attraction toward language models.?Of course, these use cases come with their own challenges such as hallucination, trust, and more.

Chain-of-Thought Prompting

Chain-of-Thought Prompting is a method that significantly improves the ability of large language models (LLMs) to perform complex reasoning tasks. By providing a few chain-of-thought demonstrations as exemplars (in few-shot form) during the prompting process, the LLM is guided to think step-by-step and reason more effectively. This method has been shown to enhance performance on arithmetic, common-sense, and symbolic reasoning tasks. It also achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even fine-tuned GPT-3 models with a verifier.

Example of a Chain-of-Thought Prompt:

Step 1: Read the problem: "John has 8 apples. He gives 3 apples to Mary. How many apples does John have left?
Step 2: Calculate the difference between the initial number of apples and the apples given away: 8 - 3 = 5
Step 3: Answer the question: John has 5 apples left."        

In this example, the CoT prompt guides the LLM through the steps needed to solve the problem, enabling the model to reason more effectively and generate a correct answer.

Example:

No alt text provided for this image
Taken from the CoT reference paper: https://arxiv.org/abs/2201.11903

The main advantage of Chain-of-Thought Prompting is that it teaches LLMs to follow a structured thought process, which helps the model better understand and reason about the input. This leads to improved performance on various tasks and makes the LLM more useful for a wide range of applications.

However, it's important to note that CoT prompting may still face challenges, such as hallucination, trust, and error propagation, which can affect the accuracy and reliability of the generated output.

Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Self-ask

Self-Ask Prompting is an elicitive prompting technique that aims to improve the compositional reasoning abilities of large language models (LLMs) by encouraging the model to ask itself follow-up questions. This method addresses the "compositionality gap," which refers to the fact that LLMs' single-hop performance tends to improve more rapidly than multi-hop performance. As a result, the compositionality gap remains unchanged as the models increase in size and complexity.

Researchers propose using self-ask techniques to narrow this gap and demonstrate improved accuracy when the LLM uses a search engine to ask itself follow-up questions. The idea is to guide the model to think more deeply and critically about the problem at hand, ultimately enhancing its reasoning capabilities.

No alt text provided for this image
Taken from https://arxiv.org/abs/2210.03350

The main advantage of Self-Ask prompting is that it encourages LLMs to dig deeper and reason more effectively by asking themselves relevant questions. This leads to improved performance on multi-hop reasoning tasks and makes the LLM more useful for a variety of applications.

Paper:?Measuring and Narrowing the Compositionality Gap in Language Models

Code and demo:?https://github.com/ofirpress/self-ask

ReAct

ReAct, which stands for Reasoning and Acting, is a novel prompting technique that interleaves the generation of reasoning traces and task-specific actions within large language models (LLMs). By synergizing reasoning and acting processes, ReAct enables LLMs to perform more effectively on language and decision-making tasks, outperforming state-of-the-art baselines and improving human interpretability and trustworthiness over methods without reasoning or acting components.

ReAct operates by structuring the LLM's thought process into distinct steps, including:

  1. Thought: The LLM reasons about the current situation.
  2. b. Action: The LLM selects from a set of task-specific actions, such as search, lookup, or finish.
  3. c. Observation: The LLM processes and responds to the outcomes of the chosen action.

An example prompt for ReAct is:

Solve a question answering task with interleaving Thought, Action, Observation steps. Thought can reason about the current situation, and Action can be three types:

 (1) Search[entity], which searches the exact entity on Wikipedia and returns the first paragraph if it exists. If not, it will return some similar entities to search.

 (2) Lookup[keyword], which returns the next sentence containing keyword in the current passage.

 (3) Finish[answer], which returns the answer and finishes the task.?        

ReAct helps LLMs overcome issues like hallucination and error propagation, which are prevalent in chain-of-thought reasoning. By generating human-like task-solving trajectories, ReAct outperforms imitation and reinforcement learning methods on two interactive decision-making benchmarks.

No alt text provided for this image
Taken from https://arxiv.org/abs/2210.03629

The main advantage of ReAct prompting is that it provides LLMs with a systematic way of thinking and acting, making them more effective in solving complex tasks. However, it's essential to consider that ReAct might still face challenges, such as generating suboptimal reasoning traces or failing to improve performance on specific tasks. Additionally, the effectiveness of this technique could vary depending on the model's size and architecture.

Paper: ReAct: Synergizing Reasoning and Acting in Language Models

Code and demo: ReAct: Synergizing Reasoning and Acting in Language Models

Reflexion

Reflexion is a prompting approach that aims to equip an agent with dynamic memory and self-reflection capabilities, which can enhance its reasoning trace and task-specific action choices. Inspired by the way humans use self-reflection to solve novel problems through trial and error, Reflexion allows the agent to learn from its past experiences and adapt its strategy accordingly.

The primary components of Reflexion include:

  1. Dynamic Memory: The agent builds an internal memory map of the environment, storing relevant information and using it to make better-informed decisions.
  2. Self-Reflection: The agent evaluates its past actions, identifies mistakes, and devises new plans to overcome these errors in future attempts.

No alt text provided for this image
Taken from https://arxiv.org/abs/2303.11366

Reflexion example prompt:

You will be given the history of a past experience in which you were placed in an environment and given a task to complete. 
You were unsuccessful in completing the task. Do not summarize your environment, but rather think about the strategy and path you took to attempt to complete the task. 
Devise a concise, new plan of action that accounts for your mistake with reference to specific actions that you should have taken. 
For example, if you tried A and B but forgot C, then devise a plan to achieve C with environment-specific actions. You will need this later when you are solving the same task. 
Give your plan after "Plan". 
Here are two examples: XXX        

Reflexion has been evaluated on decision-making tasks in AlfWorld environments and knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments. The approach achieved success rates of 97% and 51%, respectively, demonstrating its effectiveness in a variety of settings.

Reflexion offers several advantages, including the ability to learn from past experiences, adapt strategies dynamically, and improve performance on complex tasks. However, it's important to consider the potential limitations of this approach, such as the reliance on heuristics to identify hallucination instances and the possibility of suboptimal action sequences. The effectiveness of Reflexion may also vary depending on the model's size and architecture.

Paper:?Reflexion: an autonomous agent with dynamic memory and self-reflection

Language Models can Solve Computer Tasks

RCI is a prompting approach that encourages a Large Language Model (LLM) to iteratively criticize and improve its own decisions. This method enhances the LLM's ability to execute computer tasks guided by natural language instructions and boosts its reasoning capabilities.

The key components of RCI include:

  1. Critic: The LLM evaluates its own actions, identifies errors or areas for improvement, and suggests alternatives.
  2. Improvement: The LLM iteratively refines its actions based on the criticism received from the critic component, resulting in a more accurate and effective solution.

Solve the following task by iteratively improving your solution: [Task description]

Step 1: Propose an initial solution.

Step 2: Criticize your initial solution and suggest improvements.

Step 3: Apply the suggested improvements and propose a revised solution.

Step 4: Repeat steps 2 and 3 until a satisfactory solution is reached.

RCI has been shown to significantly outperform existing LLM methods for automating computer tasks, surpassing supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. In addition, RCI is effective in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting. When RCI is combined with CoT, the performance surpasses that of either method alone.

No alt text provided for this image
Taken from https://arxiv.org/abs/2303.17491#

RCI offers several advantages, such as enabling LLMs to execute computer tasks more effectively, improving reasoning capabilities, and allowing for iterative refinement of solutions. However, it is essential to consider the potential limitations of this approach, such as the reliance on the model's ability to generate accurate criticism and the possibility of getting stuck in a loop of improvements without convergence. The effectiveness of RCI may also depend on the model's size and architecture.

Project and demo:?https://posgnu.github.io/rci-web/

Large Language Models Are Human-Level Prompt Engineers

APE is a method for automatically generating and selecting instructions (prompts) to maximize the performance of Large Language Models (LLMs) on specific tasks. It treats the instruction as a "program" and optimizes it by searching over a pool of candidate prompts. APE's primary goal is to develop instructions that help the LLM better understand the user's intent and provide more accurate and relevant responses.

The key components of APE include:

  1. Generation: The LLM generates a set of possible prompts based on the task at hand.
  2. Evaluation: Each generated prompt is tested to determine how well it performs on the given task.
  3. Refinement: The best-performing prompts are refined by generating variations and choosing the most effective ones.

No alt text provided for this image
Taken from https://arxiv.org/abs/2211.01910

APE has demonstrated success in automatically generating instructions that outperform the prior LLM baseline by a large margin. It has achieved better or comparable performance to instructions generated by human annotators on 24 out of 24 Instruction Induction tasks and 17 out of 21 curated BIG-Bench tasks.

APE offers several advantages, such as automating the prompt engineering process, improving LLM performance, and reducing the need for human involvement in instruction generation. However, there are potential limitations, such as the computational cost of generating and evaluating multiple prompts, and the possibility that the model may not produce highly diverse or creative prompts. The effectiveness of APE may also depend on the model's size and architecture, as well as the specific task being optimized.

Code and demo: APE

Paper: Large Language Models Are Human-Level Prompt Engineers


Conclusion:

As the field of LLMs continues to grow, we can expect even more advances in prompting techniques, further revolutionizing our interaction with computers and AI systems. Stay tuned for updates on this exciting frontier.

Some further reads:

Lars Brenna

Product Manager Applied AI at Microsoft

1 年

Interesting how you write that the LLMs think and reason. Is that an expression of that we don’t understand or are unable to express (due to lack of vocabulary) what they really do?

This is great Reza! I look forward to seeing how it develops.

要查看或添加评论,请登录

Reza Bonyadi的更多文章

  • Applied AI News #10

    Applied AI News #10

    Highlights: In Industry News: Google's AI cracks superbug resistance, Opera's browser automates web tasks, and Gemini…

    1 条评论
  • Applied AI News #9

    Applied AI News #9

    Listen to extended commentary on YouTube (a little over 8 min and 30 seconds). Industry News and Standards: Quantum…

  • Applied AI News #8

    Applied AI News #8

    Listen to extended commentary on YouTube (a little over 11 min). ??Highlights: In industry: OpenAI's potential…

    1 条评论
  • Applied AI News #7

    Applied AI News #7

    Listen to extended commentary in Spotify (to come..

    2 条评论
  • From Text to Tactics: Reinforcing Chess Reasoning in a Language Model

    From Text to Tactics: Reinforcing Chess Reasoning in a Language Model

    The field of “Learning to Think” in language models has been evolving rapidly, propelled by the idea that language…

    5 条评论
  • Applied AI News #6

    Applied AI News #6

    Listen to extended commentary in Spotify or YouTube (shorter than 9 min). ?? Key Highlights This Week: In industry…

    1 条评论
  • Applied AI News: Learning to Think

    Applied AI News: Learning to Think

    Watch my extended commentary on YouTube. Hello, and welcome to the Applied AI News.

    3 条评论
  • Applied AI News #5

    Applied AI News #5

    Listen to extended commentary in Spotify or YouTube (less than 10 min). ?? Key Highlights This Week In industry news:…

  • Applied AI News #4

    Applied AI News #4

    Listen to extended commentary in Spotify or YouTube (less than 10 min). In industry and standards: Microsoft's new…

  • Semantic Functions 2.0

    Semantic Functions 2.0

    Functions that can yell at one another! Nearly two years ago, I wrote a piece called Natural Language Programming: A…

社区洞察

其他会员也浏览了