Prompt Engineering | Directional Stimulus Prompting...

I missed my flight back home today and hence had to stay an extra night in Austin. This was a boon in disguise. It gave me some additional time to research on a prompting technique. So, tonight I read the paper https://arxiv.org/pdf/2302.11520. In this article I am going to summarize my learnings from this paper and also illustrate where this can be utilized

In the rapidly advancing field of natural language processing (NLP), large language models (LLMs) like GPT-3, InstructGPT, and ChatGPT have made remarkable strides. These models exhibit impressive capabilities, such as few-shot prompting, in-context learning, and the ability to perform a variety of tasks. However, despite their vast potential, LLMs often fall short of consistently generating desired outputs in more specific or nuanced tasks. This gap in fine-tuned control over LLM outputs has been a major challenge for ensuring a consistent behaviour from the LLMs.

A novel solution has emerged from a team at the University of California, Santa Barbara, and Microsoft Research. The team introduces Directional Stimulus Prompting (DSP), a framework designed to guide LLMs more effectively toward desired outputs by using a small, tunable policy model.

The Challenge with Black-Box LLMs

LLMs like GPT-4 and PaLM are often referred to as "black-box" models because their internal parameters are not accessible for direct tuning. Users interact with these models through API calls, where they provide text-based prompts and receive responses. Although these models are incredibly powerful, their ability to generate task-specific outputs often hinges on the quality of the prompt.

This is where prompt engineering—the process of crafting specific inputs to elicit desired responses—comes into play. While manual and automated prompt engineering have seen success, they have limitations, especially when dealing with tasks that require more fine-grained or instance-specific responses. For example, in tasks like summarization or dialogue generation, models may not fully align their outputs with specific target behaviors, such as including key details or following a particular reasoning pattern.

The DSP framework addresses this issue by introducing a small policy model that generates instance-specific directional stimulus prompts, effectively guiding the LLM's response toward a more desirable outcome.

What Is Directional Stimulus Prompting?

The core concept behind DSP is the introduction of a directional stimulus—a discrete token or set of tokens that act as nuanced hints or clues for the LLM to follow. These stimuli guide the model's generation process by emphasizing key elements or aspects of the desired output.

For instance, in a summarization task, the directional stimulus might consist of important keywords that the summary should cover. In a dialogue response generation task, it could be a specific set of dialogue acts that capture the underlying intention of the response.

Unlike other methods that require additional external knowledge or complex retrieval mechanisms, DSP generates the directional stimulus solely based on the input query. This approach allows for fine-grained control without the computational inefficiencies of directly tuning the LLM's parameters.

How DSP Works

The DSP framework operates by training a small, tunable policy model to generate the directional stimulus for each input instance. This policy model can be trained through supervised fine-tuning (SFT) and/or reinforcement learning (RL).

  1. Supervised Fine-Tuning: The policy model is initially trained using a small collection of labeled data. For instance, in a summarization task, keywords from the target summaries are extracted and used to train the model to generate the appropriate stimulus prompts.
  2. Reinforcement Learning: After fine-tuning, the policy model can be further optimized using reinforcement learning. This step allows the policy model to explore and generate more effective stimuli by maximizing a reward function, such as improved ROUGE or BLEU scores for summarization tasks.

By training the policy model instead of the LLM itself, DSP sidesteps the challenges of directly tuning black-box models, making the approach more efficient and scalable.

Applications and Results

The authors of the DSP framework tested it across several NLP tasks, including summarization, dialogue response generation, and chain-of-thought reasoning. The results were promising:

  • Summarization: DSP was used to guide ChatGPT to generate summaries that more closely aligned with reference summaries. Using just a small dataset of 4,000 samples from the CNN/Daily Mail dataset, the DSP approach improved ROUGE and BLEU scores by 4-13%, even outperforming some fully supervised models.
  • Dialogue Response Generation: In task-oriented dialogue generation, DSP helped guide ChatGPT to generate more accurate and useful responses. With just 80 dialogues from the MultiWOZ dataset, DSP achieved a performance improvement of up to 41.4%, surpassing several state-of-the-art models trained on the full dataset.
  • Chain-of-Thought Reasoning: The DSP approach also improved chain-of-thought reasoning, where the policy model generated instance-specific prompts that outperformed both human-designed and automatically generated task-specific prompts. This resulted in better reasoning accuracy on datasets like MultiArith and AQuA.

How have I used it?

I wanted to solve a task to summarize conversations and extract key topics, but wanted some desired behaviors to follow as part of the extraction. I took inspiration from this approach and passed desired behaviour as hints in the prompt. Let me give an example below:

Let us assume that a meeting happened between a travel assistant and a traveller. The travel planner now wants to send out a minutes of meeting.

Sample Chat Transcript:

Traveler: "I need to book a flight to Paris for next Tuesday and a hotel near the city center." 

Travel Assistant: "I’ll look into flights for next Tuesday and get back to you. Do you have a hotel preference?" 

Traveler: "No specific preference, just something comfortable. I’ll also need transportation from the airport to the hotel." 

Travel Assistant: "Got it. I’ll also arrange transportation for you."
        

I can now write a prompt with an injected directed stimuli or a hint as below

Prompt Example :

Extract the action items from the following conversation between a traveler and a travel assistant. Focus on identifying tasks for both the traveler and the travel assistant, such as ‘Travel assistant will book the flight,’ ‘Traveler will confirm the dates,’ and similar tasks.

Transcript:

Traveler: "I need to book a flight to Paris for next Tuesday and a hotel near the city center."

Travel Assistant: "I’ll look into flights for next Tuesday and get back to you. Do you have a hotel preference?"

Traveler: "No specific preference, just something comfortable. I’ll also need transportation from the airport to the hotel."

Travel Assistant: "Got it. I’ll also arrange transportation for you."

Based on the extracted key topics and action items, draft a professional email summarizing the conversation between the traveler and the travel assistant. Use phrases like "Summary of our discussion," "Next steps," and "Looking forward to your response" to maintain a professional tone. Please specify the action items separately for the traveller and the travel assistant

Example of output :

Subject: Summary of Travel Arrangements – Trip to Paris

Dear [Traveler's Name],

I hope this message finds you well. Below is a summary of our recent conversation regarding your upcoming trip to Paris:

Summary of our discussion:

  • You plan to fly to Paris next Tuesday.
  • You require a comfortable hotel near the city center, with no specific preferences.
  • You also need transportation from the airport to the hotel.

Next steps:

For me:

  1. I will look into flight options for next Tuesday and provide the available choices.
  2. I will search for a comfortable hotel near the city center.
  3. I will arrange transportation from the airport to the hotel.

For you:

  1. Please confirm if the travel date of next Tuesday is accurate so that I can proceed with booking.

Looking forward to your confirmation, and I'll send you the details of the arrangements shortly thereafter.

Best regards, [Your Name] Travel Assistant

Additional points based on questions on this article:

Question: What does the policy model do? Mainak Sarkar

Answer:

The policy model plays a key role in guiding large language models (LLMs) toward specific desired outputs through Directional Stimulus Prompting (DSP). The policy model (e.g., T5) generates auxiliary prompts known as "directional stimulus prompts" that act as instance-specific hints to guide LLMs to align with target outputs. This helps improve performance on tasks such as summarization, dialogue generation, and reasoning.

The policy model is optimized through:

  1. Supervised Fine-Tuning (SFT): Using labeled data to train the model on specific tasks.
  2. Reinforcement Learning (RL): Further optimizing the model to generate better prompts based on performance metrics (like ROUGE scores or user preferences).

In summary, the policy model provides fine-grained control and guidance for black-box LLMs by producing input-specific hints that help steer LLMs toward the desired behavior

Mainak Sarkar

GenAI expert, Global Data Governance and Privacy leader , EVP Product & Technology , Business Advisory, Global Data Management Leader

6 个月

Quite interesting. I am trying to understand the concept of the policy model in your own example. How is the policy model different from the actual LLM generating the final response ?

回复

要查看或添加评论,请登录

Rajib Deb的更多文章

社区洞察

其他会员也浏览了