Optimizing to the Eval: GenAI Design Pattern #5
Hannes Hapke and I are writing an O’Reilly book on GenAI Design Patterns. This is Pattern #5 in the book. We would love feedback both on the content and on the structured way (problem, solution, example, considerations) we have laid the pattern out.
Optimizing to the Eval is a pattern that uses preference tuning to produce optimally performing content. Unlike optimization methods where optimization attempts to best approximate an independent source of truth, this pattern attempts to hack the model to create content the evaluation method will consider as being great.
The name is meant to evoke the idea of “teaching to the test”. In spite of this negative connotation, the pattern is quite powerful?—?if the evaluation is very robust, it’s not a problem that you have optimized to it.
Problem
Content is created to serve some purpose, whether it is to educate, to entertain, or to drive sales. So, when you create content, you might want to ensure that it is in the style that best serves that purpose.
Traditional A/B testing is a simple approach to optimize style that works in many situations. For example, suppose you have a hypothesis that the length of paragraphs matters for educational content and that paragraphs that are 3–4 sentences are best. You can have a GenAI model create content in Style A, in which paragraphs are of varied length, and in Style B, in which paragraphs are 3–4 sentences. One way to do this is that the prompt to generate Style B would include the instruction to use paragraphs of 3–4 sentences while the prompt to generate Style A wouldn’t have any such directive. Another way is that all content gets generated the same way, but for Style B, you have an LLM rewrite the content in paragraphs of 3–4 sentences each (the latter approach is shown in Figure 2-x).?
Then, you split your audience and show Style A to some learners and Style B to others. Measure which set of learners learn the content faster or retain the content longer or whatever the outcome is that you want to optimize for. Once you have a statistically significant difference, you can declare a winner. From then, on you can use the prompt or process corresponding to the style that “wins”.
Even if multiple factors affect the learning outcome, it is not a problem. You can use a multi-arm bandit to test all the possible variations efficiently.
Instead, the problem with using the traditional A/B testing approach to optimize the style of content is when you don’t know which factors matter. When you don’t have a hypothesis about the set of factors to test, it is impossible to use traditional A/B tests. The problem is threefold:
Optimizing to the Eval is a design pattern that provides an approach to generate content in the style that offers the best performance even when you don’t know the factors that go into that “winning” style.
Solution
Style optimization with this pattern is sheer jujitsu. In Optimizing to the Eval, you reframe or redefine the three problems with the traditional A/B testing approach to make them go away:
The way we resolve the third problem is what gives the pattern its name?—?we decide that we don’t care why some piece of content is better than another?—?whether it is paragraph length, or tone, or whatever. Instead, we decide to simply do whatever we need to do to make our content “win”. Here, since we are creating the content using an LLM?, what we will do is to tune that LLM’s weights so that it produces the type of content that “wins”?—?we can do this by using this dataset of winners and losers to do preference tuning of the LLM. Then, to reliably create content that meets the style of the winning entries, we deploy and utilize the preference tuned LLM going forward.
The steps of the solution are as follows (see Figure 2-x):
Let’s look at each of these steps.
Step 1: Generate Pairs of?Content
There are several ways to generate two pieces of content in different styles from the same prompt: repeated generation, generation settings, and prompt rewriting.
Repeated generation
To generate pairs of content from the same prompt, you can use the exact same prompt and generate two instances of the output. Because LLM generation is stochastic, the two pieces will have some natural variations as long as you make sure to turn off any caching and ensure that you are not doing greedy sampling of the output (i.e. temperature > 0 and top_k > 1). That is enough to obtain two pieces of content that have some variations in style.
To illustrate this, we sent the following prompt to Mistral-7B-Instruct-v0.2 twice:
Where does the term “knee-jerk reaction” come from?
The first time, the response that came back started with:
The term “knee-jerk reaction” refers to an immediate, often unreflective response to a stimulus. It comes from the medical procedure used to test the reflexes in the knee.
The second time, the response started with:
The term “knee-jerk reaction” comes from the medical reflex test where the knee jerks up when the patellar tendon is tapped. This reflex is an automatic, unreflective response to a stimulus.
This prompt asks for a factual answer, and so the responses should vary the least. Yet, the responses vary in style quite dramatically. The first seems to be targeting a layperson while the second is addressed to a person who knows what the patellar tendon is.
Change generation settings
You can change settings such as the temperature or top-P to get two pieces of content to the same prompt:
paired_content = []
for iter in range(2):
response = pipe(input_message,
temperature=random.uniform(0.2, 0.9),
)
paired_content.append(response[0]['generated_text'][-1]['content'])
Prompt rewriting
You can have an LLM reword the original prompt in arbitrary ways such as:
Rephrase this sentence at a grade-school reading level: “Where does the term “knee-jerk reaction” come from?”
When we did this with Mistral-7B-Instruct-v0.2, we got the alternate prompt:
What’s the origin of the phrase ‘knee-jerk reaction’?
Note that you are rewriting the prompt?—?you are not rephrasing the content at a grade-school reading level! Other such ideas for prompt rewriting that do not change the intent of the prompt include asking an LLM to make the prompt more concise or verbose.
A powerful prompt rewriting technique is to ask an LLM to add an appropriate follow-on question to the prompt. For example, when we asked Mistral:
What is an appropriate follow-on question to this query: “Where does the term “knee-jerk reaction” come from?”
we were suggested:
Can the origin of the phrase ‘knee-jerk reaction’ be traced back to a specific medical procedure or historical event?
We then modified the original prompt by appending the follow-on question. The response to this modified prompt is in a completely different style:
The term “knee-jerk reaction” does have a medical origin, but it doesn’t directly refer to a specific historical event or medical procedure. The term comes from the Medico-Psychological Association’s demonstration of the reflex at the 1878 Chicago World’s Fair.
With the two versions of the prompt, the original and the rewritten, you can generate two pieces of content.
You can randomly select amongst repeated generation, generation settings, and prompt rewriting, or combine them to give yourself many variations. At the end of Step 1, you have two pieces of content that have been generated to essentially the same prompt.
Step 2: Compare the two content?pieces
In Step 2, you need to determine which of those two pieces of content is better. You could do this in one of three ways: based on human labeling, using an evaluator, or choosing the item that provides the better outcome.
Variant 1: Human?Labeling
Show a human expert and ask them to choose the one that’s better. If you are in an application area where experts tend to disagree, you can show the pieces of content to a panel of experts and choose the majority opinion or discard pairs where the decision is not clear-cut.
In some situations, you can even show the two pieces of content side-by-side as drafts and see which one your end-users actually select.
Variant 2: Evaluator
If your industry has metrics for content quality, you can score the two pieces of content on a rubric that consists of one or a combination of those metrics. For example, if you are generating SQL statements, your evaluator might consist of a sandbox that runs those SQL statements on an in-memory database and records the output and the time taken. You might then choose the statement that returns the correct result. If both statements return the correct result, choose the one that is more concise. If both statements are the same length, choose the one that runs faster.
If your industry has best practices, you might evaluate the content on those best practices. For example, marketing content is commonly evaluated on the 4Ps and 3Cs framework. You can often use a frontier LLM to perform the comparison because it might already have encountered the framework in its training. If not, you can add a description of the framework in the prompt context. This idea is called LLM-as-a-judge.
For example, we used Gemini 2.0 Flash to rate a piece of marketing content from Amazon using the prompt:
You are a marketing expert who is rating marketing content in the range of 0–7 where each point corresponds to one of the items in the 4Ps and 3Cs marketing framework. Read the following article and give it a numeric score, explaining your reasoning
The result was a score of 6 because the article “does a good job of describing the new features of the Kindle family of e-readers, but it could be improved by including more information about the price and availability of the devices.”
Instead of having the LLM score each piece of content separately and comparing the scores, you can pass in both pieces of content and ask it to determine which one is better based on the scoring rubric. This is the approach that we will follow in the Example section below.
Variant 3: Better?Outcome
A third option is to use the actual outcome associated with each piece of content to determine the winner of a paired comparison. There are a couple of ways that you might be able to do this, depending on the problem context:
At the end of Step 2, you will have a way to compare two pieces of content that were generated in response to the same/similar prompt and declare one of them the winner.
<helpful-hint>
Step 2, the evaluation step, is by far the most important step in the Style Optimization pattern. Therefore, it is worth being deliberative to get it right:
Experiment with different ideas and choose the one that works best. You may have to train and deploy the model in order to determine this. Don’t be afraid to come back and revisit this decision during the lifecycle of your project.
</helpful-hint>
It’s worth reflecting that although “teaching to the test” has a negative connotation, it is not a problematic practice if the test in question is quite robust and reflects reality. If your evaluation mechanism is to evaluate the output using a panel of experts, is it really a problem if the LLM produces content that these experts will consider as being good? You hope that the judgement of these experts generalizes. What you need to be careful about is that the metrics or outcomes you are optimizing are too narrow. You don’t want to be in the position where the LLM might be creating content in styles that game your metrics or optimize towards short-term goals. For example, if the outcome you optimize for is engagement time, the LLM might produce hard-to-understand content because it takes longer to read!
Step 3: Create a training?dataset
Create a training dataset where each example consists of three attributes: prompt (use the original prompt if you did prompt rewriting), chosen, and rejected. For example:
{
"prompt": "Where does the term \"knee-jerk reaction\" come from?",
"chosen": "The term \"knee-jerk reaction\" refers to an immediate, often unreflective response to a stimulus. It comes from the medical procedure used to test the reflexes in the knee.",
"rejected": "The term \"knee-jerk reaction\" comes from the medical reflex test where the knee jerks up when the patellar tendon is tapped. This reflex is an automatic, unreflective response to a stimulus.",
}
The format above is a standard format for datasets used for preference tuning, which is what we will do next. If your preference tuning framework expects a different format, you’d create your training dataset in that format of course.
You might also split this dataset into training and evaluation splits.
Step 4: Do Preference Tuning
Preference tuning is a way of training an LLM on a dataset of paired items so that it outputs responses closer to the chosen items than to the rejected ones. The original way to do preference tuning, introduced in 2017, was to use Reinforcement Learning. However, we recommend that you employ the Direct Preference Optimization (DPO) approach as described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Raifalov et al. in 2023. The DPO approach is much faster fo reasons that are beyond the scope of this book.
An implementation of DPO is provided by the Transformer Reinforcement Learning (TRL) package. To train a model using DPO, first instantiate the model and a tokenizer for it:
MODEL_ID = "Qwen/Qwen2–0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
Then, launch the training using:
training_args = DPOConfig(output_dir="Qwen-DPO")
trainer = DPOTrainer(model=model, args=training_args,
processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
Once the training is complete, save the model
trainer.save_model(training_args.output_dir)
The trained model can be deployed or loaded into a Transformer pipeline from local storage and used in lieu of the original model. This model will produce output in the style that it learned from the set of chosen and rejected responses.
Example
Let’s say that you want to use an LLM to generate classified ads. Let’s also assume that, for latency reasons, you want to use a small language model (SLM). Here, we’ll use the 0.5B parameter Qwen2 model. (The complete code is in the book’s GitHub repository in examples/style_optimization.)
Zero-shot doesn’t?work
You can ask the model to generate an ad using the following prompt:
You are a resident who is listing a used item for sale on a neighborhood online group. An ad for used items in this neighborhood group is 1–3 sentences.
Write an ad to sell a {item} priced at {price}
When we used the prompt to write an ad for a $300 used bike:
create_classified_ad("3-year old Specialized bike", "$300")
we got an ad that started with:
Welcome to the Adventure with our Specialized Bike!
Specialized bikes have been a favorite among mountain bikers for years, and they’re here again! Our latest model, the Specialized 3-Year Old, offers unparalleled performance and durability. This bike has been designed with safety and comfort in mind, making it perfect for beginners or seasoned riders alike.
This is quite obviously a bad ad. It sounds as if the ad is for a new bike from the manufacturer whereas we wanted a for-sale-by-owner ad to sell a 3-year old bike.
The model is too small to generate good ads. But as we will see, we can use the Optimizing to the Eval pattern to train this model to produce good ad copy.
Generate pairs of?ads
The first step is to generate pairs of ads. Let’s use the Generation Settings approach and choose a random temperature each time.
Here are two ads we got for the $300 bike when we did this:
Hey there! We’re looking for someone who’s ready to take their riding game to the next level with our 3-year-old Specialized bike. This bike is a great investment that will keep your kids engaged and safe all year round. It features durable components, a comfortable seat, and a powerful frame that can handle any terrain. Plus, it comes with a lifetime warranty, so you can be sure you’re getting a high-quality product. So why wait? Get yours today and start enjoying the thrill of riding on wheels!
and
Looking for a unique and stylish way to enjoy your daily commute? Look no further than the 3-year-old Specialized bike! This bike is perfect for those who value style over speed, and it’s priced at just $300. With its durable frame and high-quality components, this bike will last you years with minimal maintenance. Plus, it comes with a lifetime warranty, so you can rest easy knowing that you’re getting a quality product that won’t let you down. Don’t miss out on this opportunity to upgrade your bike experience today!
Compare two ads using LLM-as-a-judge
To compare the two ads, you can use the LLM-as-a-judge approach. Because evaluation is the most important step in this pattern, it is important to use a good evaluator. Let’s use a frontier LLM and have it output which ad is better, and the reason:
You are a professor of advertising at a business school.
Compare the two ads below for the same item being sold in a neighborhood marketplace and determine whether ad_a is better than ad_b
Also explain your reasoning.
The main criteria to compare the ads include:
* Is it clear what’s being sold? Age, brand, price, and condition are important.
* Does it target the most relevant audience for the item? Is the text persuasive to that audience?
* Is it concise and easy to read? An ideal ad is at most 3 sentences.
* Does it include contact information? Ideally, the ad specifies the preferred means of communication.
* Is the ad truthful? Remember that the item is likely used and not being sold by the manufacturer.
ad_a:
{ad_a}
ad_b:
{ad_b}
By optimizing to this eval, you are saying that we don’t care what style the ads are written in, as long as the ads are clear, relevant to the audience, short, actionable, and truthful. If writing the ads in rhyme is what makes the bike sell, you are perfectly okay with the LLM choosing to output rhymes.
Asking for the reason forces the LLM to do the comparison deliberatively (see the Chain-of-Thought design pattern covered later in this book). The output can be of this form:
@dataclass
class AdsComparison:
ad_a_is_better_than_ad_b: bool
reasoning: str
The result, when we did the comparison using Gemini 2.0 Flash, for the two ads whose text appears in the preceding section was:
AdsComparison
ad_a_is_better_than_ad_b=False
reasoning=”Both ads have issues, but ad_b is slightly better because it includes the price. Neither ad includes contact information. Both ads make the mistake of claiming the bike has a lifetime warranty, which is unlikely for a used bike being sold in a neighborhood marketplace. Ad_a is targeted toward children, but ad_b is targeted toward adults. Since the bike is used, it’s more likely to be purchased by an adult.”
The reasoning and comparison appear correct.
Generate training?dataset
You can create a single preference example consisting of the prompt, chosen content, and rejected content by putting together the preceding steps:
def create_preference_example(item: str, price: str) -> dict:
ad1 = create_classified_ad(item, price)
ad2 = create_classified_ad(item, price)
score = score_ad(ad1, ad2)
preference_example = {
"prompt": SYSTEM_PROMPT + f"""Write an ad to sell a {item} priced at {price}"""
}
if score.ad_a_is_better_than_ad_b:
preference_example['chosen'] = ad1
preference_example['rejected'] = ad2
else:
preference_example['chosen'] = ad2
preference_example['rejected'] = ad1
preference_example['score_reason'] = score.reasoning
return preference_example
Call this in a loop with a random selection from a set of items to create a jsonl file:
items_for_sale = [
("3-year old Specialized road bike", "$300"),
("Amazing Spider-Man 361", "$200"),
…
("Set of 2 vintage Pyrex coffee mugs (in original box)", "$20"),
("Hand-carved wooden serving spoon", "$15")
]
def write_jsonl(num_examples: int, filename: str):
examples = []
for iter in range(num_examples):
print(iter, end=" … ")
item, price = random.choice(items_for_sale)
example = create_preference_example(item, price)
examples.append(example)
with open(filename, "w") as ofp:
for example in examples:
json.dump(example, ofp)
ofp.write('\n')
This file can be used for preference tuning.
Do preference tuning
To perform preference tuning, first load the dataset:
train_dataset = load_dataset('json', data_files="ad_preference_dataset.jsonl", split='train')
Then, create the model you wish to tune:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
Finally, launch the DPOTrainer with this dataset and the model:
trainer = DPOTrainer(model=model, …, train_dataset=train_dataset)
trainer.train()
The training process for 100 examples, when we did it, took about 3 minutes on a machine with 8 vCPUs and a L4 GPU.
Inference
You can create a pipeline with the local trained model just as if the model were on the HuggingFace hub:
pipe = pipeline(
task="text-generation",
model="ClassifiedAds-DPO",
use_fast=True,
kwargs={
"return_full_text": False,
},
model_kwargs={}
)
Then, create new content using the same prompt as before:
def create_classified_ad(item: str, price: str) -> str:
system_prompt = SYSTEM_PROMPT
user_prompt = f"""
Write an ad to sell a {item} priced at {price}
"""
input_message = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
results = pipe(input_message,
max_new_tokens=256,
pad_token_id=pipe.tokenizer.eos_token_id
)
return results[0]['generated_text'][-1]['content'].strip()
Trying out the preference-tuned model for a book:
create_classified_ad("book Pachinko by Min Jin Lee", "$5")
yields:
Pachinko, the classic tale of a man\’s obsession with gambling and his love for a woman he meets while playing a pachinko game. A rare edition priced at $5. For more information or to arrange pickup, please contact [Your Name] at [Your Phone Number]. Thank you!
Zero-shot with the tuned model works! The tuned model is outputting clear, concise, and persuasive ads because that’s what the evaluation mechanism preferred.
Considerations
Optimizing to the Eval uses preference tuning to optimize a GenAI model to produce content in the style that has the best performance as determined by an evaluator.
Choosing between the?variants
There are several variants of the pattern that differ primarily on how the preference tuning dataset is created.
In Step 1, you can choose how to generate two pieces of content in response to the same prompt. While repeated generation can work, changing the settings or rewriting the prompt will typically yield content that is more meaningfully different and covers more of the style space.
In Step 2, you can choose how to perform the comparison of the two pieces of content to determine which one is better. The approach you take often depends on how the content will be used:
The evaluation approach you take could be based on what you already have available. For example, if you already have defined metrics or a rubric, you could use them. If the metrics and rubrics are subjective (e.g. “is readable”) or requires content extraction (e.g. “includes contact info”), you can use LLM-as-a-judge. If you have an user interface that allows you to present a set of choices to users, then populate those choices with content pairs. If all queries are logged, and there is no reason that content to the same query (such as personalization) will vary in substance, consider using the matched pairs approach.
It is unlikely that you will have only one option available to you. Therefore, you may have to try multiple approaches and see which one gives you the best results.
In-distribution requirement
There are several places in this pattern where an LLM is being used:
The content that is generated by the LLM (#2 in the list) has to be text that the LLM being trained (#4 in the list) can produce. No amount of preference tuning can teach the LLM new facts or new tokens, for example. There are two ways to ensure this in-distribution requirement, the easy way and the hard way:
You will need the latter approach if the small model doesn’t have the capability to produce content that is adequate. Note that the content need not be good?—?the ads generated by the untuned Qwen2 model were not good, but they were ads and they could be evaluated using our evaluator. That was sufficient to tune Qwen2 to produce good ads.
Extension to?images
Although we discussed this pattern with just text examples, preference tuning can be carried out on images?—?for example, you can use DiffusionDPO to tune diffusion image models. Therefore, you can apply Style Optimization to create images in your preferred style.
For example, to train the runwayml Stable Diffusion model, create a pipeline:
pretrained_model_name = "runwayml/stable-diffusion-v1–5"
pipe = StableDiffusionPipeline.from_pretrained(
pretrained_model_name,
torch_dtype=torch.float16).to('cuda')
Use repeated generation (i.e. just run the prompt twice) to generate two images that differ in unknown ways:
for iter in range(2):
…
img = pipe(prompt=prompt, generator=generator, guidance_scale=7.5).images[0]
…
Now comes the key step?—?evaluation. How is one image better than another? Assuming that the images are being generated for a newsletter, you could use the Outcome method to select the image in a pair that drives more article clicks.
With the dataset of images created in this manner, you can launch the training script to create a preference-tuned Stable Diffusoin model.
accelerate launch - mixed_precision="fp16" train.py \
- pretrained_model_name_or_path=$MODEL_NAME \
- dataset_name=$DATASET_NAME \
…
- output_dir="tmp-sd15"
The output of this model is now in the style that would drive more clicks.
Continuous improvement
You can improve quality by having a systematic approach to data collection, and by iterating on training.
Diverse prompts
DPO needs only a few high quality examples. In preference tuning, unlike with deep learning, quality of data trumps quantity. The keys to having high-quality examples are twofold:
We have already discussed the importance of having a high-quality evaluator.
How do you obtain diverse prompts? One approach is to deploy your solution early and start to collect feedback. Log both the prompts and feedback about the responses. Add any prompt that causes subpar responses to the set of prompts that you use to create your preference training dataset. Carry out topic modeling and other techniques to identify outlier prompts. Add these outliers to the training dataset. Finally, have a systematic approach to managing bugs and feedback. Add these prompts also to your training dataset.
Iterative training
We described the Optimizing to the Eval pattern as having four steps, the first three of which are about creating a preference dataset and the fourth about invoking a DPO trainer. At the end of training, you are left with a tuned model that produces content in the desired style. DPO training on a hundred examples takes only seconds. If your evaluation method is automated (i.e., doesn’t involve human labeling), all four steps of the pattern can be done quickly.
If you can perform all four steps quickly, training need not be a once-and-done process. You can repeat the four steps with the newly trained LLM, to create a preference dataset where all the examples are improved over the original. Then, invoke the DPO trainer to obtain an LLM that is an improvement over the one you used to create the content in the examples. Of course, the process will saturate at some point, but such meta-training gives you the ability to maximize the quality that an LLM of that size is capable of achieving at a very reasonable cost.
You will need a larger library of examples in order to do multiple training runs over your prompt library. You could consider increasing the number of iterations once you have a deployed system and have been logging user prompts for a while. Another way to get a larger library of examples is to train the LLM on prompts associated with many different problems. The ability to get substantial improvements using iterative DPO on synthetic data corresponding to easily verifiable problems was one of the key aha moments on DeepSeek. You can consider the same approach if you have a number of easily verifiable problems and can use these to expand the training dataset.
The important ingredients to set up a continuous improvement cycle are a fast and high-quality evaluator, a systematic approach to collect and manage prompts, and training until saturation is reached.
Next steps
Hannes Hapke and I are writing an O’Reilly book on GenAI Design Patterns. This is Pattern #5 in the book. We would love feedback on the content itself, but also on the structure of the pattern discussion.
Data & AI leader | Retail | Experienced C-suite Partner | PhD (Statistics)
1 个月Isn't that what we do in machine learning/stats - 'hack' the model to estimate the parameters that the evaluation method, aka loss fn, considers great! Not so much of a negative thing as you mention, but then the more robust the eval criterion, in general the more challenging it would be to come up with an optimal soln. Congratulations and thank you for writing a book on GenAI design patterns! Need of the hour. Looking forward ...