Optimizing to the Eval: GenAI Design Pattern #5

Valliappa Lakshmanan

Data/AI products and platforms

发布日期: 2025年2月23日

Hannes Hapke and I are writing an O’Reilly book on GenAI Design Patterns. This is Pattern #5 in the book. We would love feedback both on the content and on the structured way (problem, solution, example, considerations) we have laid the pattern out.

Optimizing to the Eval is a pattern that uses preference tuning to produce optimally performing content. Unlike optimization methods where optimization attempts to best approximate an independent source of truth, this pattern attempts to hack the model to create content the evaluation method will consider as being great.

The name is meant to evoke the idea of “teaching to the test”. In spite of this negative connotation, the pattern is quite powerful?—?if the evaluation is very robust, it’s not a problem that you have optimized to it.

Problem

Content is created to serve some purpose, whether it is to educate, to entertain, or to drive sales. So, when you create content, you might want to ensure that it is in the style that best serves that purpose.

Traditional A/B testing is a simple approach to optimize style that works in many situations. For example, suppose you have a hypothesis that the length of paragraphs matters for educational content and that paragraphs that are 3–4 sentences are best. You can have a GenAI model create content in Style A, in which paragraphs are of varied length, and in Style B, in which paragraphs are 3–4 sentences. One way to do this is that the prompt to generate Style B would include the instruction to use paragraphs of 3–4 sentences while the prompt to generate Style A wouldn’t have any such directive. Another way is that all content gets generated the same way, but for Style B, you have an LLM rewrite the content in paragraphs of 3–4 sentences each (the latter approach is shown in Figure 2-x).?

Then, you split your audience and show Style A to some learners and Style B to others. Measure which set of learners learn the content faster or retain the content longer or whatever the outcome is that you want to optimize for. Once you have a statistically significant difference, you can declare a winner. From then, on you can use the prompt or process corresponding to the style that “wins”.

Traditional A/B testing can help you optimize style, but only if you know the factors, such as paragraph length, that might matter.

Even if multiple factors affect the learning outcome, it is not a problem. You can use a multi-arm bandit to test all the possible variations efficiently.

Instead, the problem with using the traditional A/B testing approach to optimize the style of content is when you don’t know which factors matter. When you don’t have a hypothesis about the set of factors to test, it is impossible to use traditional A/B tests. The problem is threefold:

Indistinguishable sets: If you have no hypothesis of the factors that make a difference, there is nothing to differentiate Set A from Set B.
Indeterminate test: If the two sets are indistinguishable, it is likely that your test will never yield statistically significant differences.
Inability to use results. Even if, by chance, you get to a point that Set A is statistically better than Set B, how do you use the results? If you don’t know what differentiates Set A from Set B, how can you reliably generate content that is of the winning style?

Optimizing to the Eval is a design pattern that provides an approach to generate content in the style that offers the best performance even when you don’t know the factors that go into that “winning” style.

Solution

Style optimization with this pattern is sheer jujitsu. In Optimizing to the Eval, you reframe or redefine the three problems with the traditional A/B testing approach to make them go away:

Indistinguishable sets: If you don’t know what factors of style make a difference, the elements of sets A and B would be indistinguishable. In Optimizing to the Eval, you get around this problem by comparing just two pieces of content at a time and defining Set A to be content that “wins” the pair comparison and Set B to be set of content that doesn’t.
Indeterminate test: You define a test as just a single comparison and drop the need for the test itself to be statistically significant. The two content pieces have to be comparable, of course, so they should be created by the same LLM (assume text for now, although this approach generalizes to images also) in response to prompts that are nearly the same.
Inability to use results. This problem seems insurmountable because you have a Set A that consists of content that won a pairwise comparison and a Set B that lost the contest but you don’t know why. If you have 100 items in Set A, you may have 100 different prompts and no idea about the factors that differentiate them. So, how would you change the prompts in the deployed system? We get around this problem by not changing the prompts. Instead, we change the LLM!

The way we resolve the third problem is what gives the pattern its name?—?we decide that we don’t care why some piece of content is better than another?—?whether it is paragraph length, or tone, or whatever. Instead, we decide to simply do whatever we need to do to make our content “win”. Here, since we are creating the content using an LLM?, what we will do is to tune that LLM’s weights so that it produces the type of content that “wins”?—?we can do this by using this dataset of winners and losers to do preference tuning of the LLM. Then, to reliably create content that meets the style of the winning entries, we deploy and utilize the preference tuned LLM going forward.

The steps of the solution are as follows (see Figure 2-x):

Generate pairs of content from the same prompt.
Compare the two content pieces and pick a winner.
Create a training dataset for preference tuning
Perform preference tuning and use the tuned LLM going forward

Let’s look at each of these steps.

Optimizing to the Eval involves four steps. The first three steps have to do with creating a preference dataset, and the fourth involves Direct Preference Optimization (DPO) to train an LLM to produce documents that are in the preferred style.

Step 1: Generate Pairs of?Content

There are several ways to generate two pieces of content in different styles from the same prompt: repeated generation, generation settings, and prompt rewriting.

Repeated generation

To generate pairs of content from the same prompt, you can use the exact same prompt and generate two instances of the output. Because LLM generation is stochastic, the two pieces will have some natural variations as long as you make sure to turn off any caching and ensure that you are not doing greedy sampling of the output (i.e. temperature > 0 and top_k > 1). That is enough to obtain two pieces of content that have some variations in style.

To illustrate this, we sent the following prompt to Mistral-7B-Instruct-v0.2 twice:

Where does the term “knee-jerk reaction” come from?

The first time, the response that came back started with:

The term “knee-jerk reaction” refers to an immediate, often unreflective response to a stimulus. It comes from the medical procedure used to test the reflexes in the knee.

The second time, the response started with:

The term “knee-jerk reaction” comes from the medical reflex test where the knee jerks up when the patellar tendon is tapped. This reflex is an automatic, unreflective response to a stimulus.

This prompt asks for a factual answer, and so the responses should vary the least. Yet, the responses vary in style quite dramatically. The first seems to be targeting a layperson while the second is addressed to a person who knows what the patellar tendon is.

Change generation settings

You can change settings such as the temperature or top-P to get two pieces of content to the same prompt:

paired_content = []
for iter in range(2):
  response = pipe(input_message,
                  temperature=random.uniform(0.2, 0.9),
  )
  paired_content.append(response[0]['generated_text'][-1]['content'])

Prompt rewriting

You can have an LLM reword the original prompt in arbitrary ways such as:

Rephrase this sentence at a grade-school reading level: “Where does the term “knee-jerk reaction” come from?”

When we did this with Mistral-7B-Instruct-v0.2, we got the alternate prompt:

What’s the origin of the phrase ‘knee-jerk reaction’?

Note that you are rewriting the prompt?—?you are not rephrasing the content at a grade-school reading level! Other such ideas for prompt rewriting that do not change the intent of the prompt include asking an LLM to make the prompt more concise or verbose.

A powerful prompt rewriting technique is to ask an LLM to add an appropriate follow-on question to the prompt. For example, when we asked Mistral:

What is an appropriate follow-on question to this query: “Where does the term “knee-jerk reaction” come from?”

we were suggested:

Can the origin of the phrase ‘knee-jerk reaction’ be traced back to a specific medical procedure or historical event?

We then modified the original prompt by appending the follow-on question. The response to this modified prompt is in a completely different style:

The term “knee-jerk reaction” does have a medical origin, but it doesn’t directly refer to a specific historical event or medical procedure. The term comes from the Medico-Psychological Association’s demonstration of the reflex at the 1878 Chicago World’s Fair.

With the two versions of the prompt, the original and the rewritten, you can generate two pieces of content.

You can randomly select amongst repeated generation, generation settings, and prompt rewriting, or combine them to give yourself many variations. At the end of Step 1, you have two pieces of content that have been generated to essentially the same prompt.

Step 2: Compare the two content?pieces

In Step 2, you need to determine which of those two pieces of content is better. You could do this in one of three ways: based on human labeling, using an evaluator, or choosing the item that provides the better outcome.

Variant 1: Human?Labeling

Show a human expert and ask them to choose the one that’s better. If you are in an application area where experts tend to disagree, you can show the pieces of content to a panel of experts and choose the majority opinion or discard pairs where the decision is not clear-cut.

In some situations, you can even show the two pieces of content side-by-side as drafts and see which one your end-users actually select.

Variant 2: Evaluator

If your industry has metrics for content quality, you can score the two pieces of content on a rubric that consists of one or a combination of those metrics. For example, if you are generating SQL statements, your evaluator might consist of a sandbox that runs those SQL statements on an in-memory database and records the output and the time taken. You might then choose the statement that returns the correct result. If both statements return the correct result, choose the one that is more concise. If both statements are the same length, choose the one that runs faster.

If your industry has best practices, you might evaluate the content on those best practices. For example, marketing content is commonly evaluated on the 4Ps and 3Cs framework. You can often use a frontier LLM to perform the comparison because it might already have encountered the framework in its training. If not, you can add a description of the framework in the prompt context. This idea is called LLM-as-a-judge.

For example, we used Gemini 2.0 Flash to rate a piece of marketing content from Amazon using the prompt:

You are a marketing expert who is rating marketing content in the range of 0–7 where each point corresponds to one of the items in the 4Ps and 3Cs marketing framework. Read the following article and give it a numeric score, explaining your reasoning

{text of https://www.aboutamazon.com/news/devices/new-kindle-color-scribe-paperwhite-entry}

The result was a score of 6 because the article “does a good job of describing the new features of the Kindle family of e-readers, but it could be improved by including more information about the price and availability of the devices.”

Instead of having the LLM score each piece of content separately and comparing the scores, you can pass in both pieces of content and ask it to determine which one is better based on the scoring rubric. This is the approach that we will follow in the Example section below.

Variant 3: Better?Outcome

A third option is to use the actual outcome associated with each piece of content to determine the winner of a paired comparison. There are a couple of ways that you might be able to do this, depending on the problem context:

Direct measurement: Suppose you are creating content for a newsletter. You might push one piece of content to half of your user base and the other piece to the other half. Then, you can measure the number of people who perform the call to action, whether it is to click on a link or to sign a petition. The piece of generated content that gathers more clicks or more signatures is the one that’s chosen.
Matched prompts: Suppose you are generating answers to user queries in an automated ticket-handling system. Then, you can identify semantically similar user queries and pair them up. For each matched pair of queries, compare the outcome associated with the corresponding content, such as the speed of issue resolution. The content that led to a faster resolution is the winner.

At the end of Step 2, you will have a way to compare two pieces of content that were generated in response to the same/similar prompt and declare one of them the winner.

<helpful-hint>

Step 2, the evaluation step, is by far the most important step in the Style Optimization pattern. Therefore, it is worth being deliberative to get it right:

Choose the variant that works best in the context of your problem, experimenting with different ideas if necessary.
Ensure that you have chosen the right reward function on which to compare the two items.
Verify that the evaluation is happening correctly, and that the resulting comparisons meet your gut-instinct of which one is better

Experiment with different ideas and choose the one that works best. You may have to train and deploy the model in order to determine this. Don’t be afraid to come back and revisit this decision during the lifecycle of your project.

</helpful-hint>

It’s worth reflecting that although “teaching to the test” has a negative connotation, it is not a problematic practice if the test in question is quite robust and reflects reality. If your evaluation mechanism is to evaluate the output using a panel of experts, is it really a problem if the LLM produces content that these experts will consider as being good? You hope that the judgement of these experts generalizes. What you need to be careful about is that the metrics or outcomes you are optimizing are too narrow. You don’t want to be in the position where the LLM might be creating content in styles that game your metrics or optimize towards short-term goals. For example, if the outcome you optimize for is engagement time, the LLM might produce hard-to-understand content because it takes longer to read!

Step 3: Create a training?dataset

Create a training dataset where each example consists of three attributes: prompt (use the original prompt if you did prompt rewriting), chosen, and rejected. For example:

{
"prompt": "Where does the term \"knee-jerk reaction\" come from?",
"chosen": "The term \"knee-jerk reaction\" refers to an immediate, often unreflective response to a stimulus. It comes from the medical procedure used to test the reflexes in the knee.",
"rejected": "The term \"knee-jerk reaction\" comes from the medical reflex test where the knee jerks up when the patellar tendon is tapped. This reflex is an automatic, unreflective response to a stimulus.",
}

The format above is a standard format for datasets used for preference tuning, which is what we will do next. If your preference tuning framework expects a different format, you’d create your training dataset in that format of course.

You might also split this dataset into training and evaluation splits.

Step 4: Do Preference Tuning

Preference tuning is a way of training an LLM on a dataset of paired items so that it outputs responses closer to the chosen items than to the rejected ones. The original way to do preference tuning, introduced in 2017, was to use Reinforcement Learning. However, we recommend that you employ the Direct Preference Optimization (DPO) approach as described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Raifalov et al. in 2023. The DPO approach is much faster fo reasons that are beyond the scope of this book.

An implementation of DPO is provided by the Transformer Reinforcement Learning (TRL) package. To train a model using DPO, first instantiate the model and a tokenizer for it:

MODEL_ID = "Qwen/Qwen2–0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Then, launch the training using:

training_args = DPOConfig(output_dir="Qwen-DPO")
trainer = DPOTrainer(model=model, args=training_args,
          processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

Once the training is complete, save the model

trainer.save_model(training_args.output_dir)

The trained model can be deployed or loaded into a Transformer pipeline from local storage and used in lieu of the original model. This model will produce output in the style that it learned from the set of chosen and rejected responses.

Example

Let’s say that you want to use an LLM to generate classified ads. Let’s also assume that, for latency reasons, you want to use a small language model (SLM). Here, we’ll use the 0.5B parameter Qwen2 model. (The complete code is in the book’s GitHub repository in examples/style_optimization.)

Zero-shot doesn’t?work

You can ask the model to generate an ad using the following prompt:

You are a resident who is listing a used item for sale on a neighborhood online group. An ad for used items in this neighborhood group is 1–3 sentences.

Write an ad to sell a {item} priced at {price}

When we used the prompt to write an ad for a $300 used bike:

create_classified_ad("3-year old Specialized bike", "$300")

we got an ad that started with:

Welcome to the Adventure with our Specialized Bike!

Specialized bikes have been a favorite among mountain bikers for years, and they’re here again! Our latest model, the Specialized 3-Year Old, offers unparalleled performance and durability. This bike has been designed with safety and comfort in mind, making it perfect for beginners or seasoned riders alike.

This is quite obviously a bad ad. It sounds as if the ad is for a new bike from the manufacturer whereas we wanted a for-sale-by-owner ad to sell a 3-year old bike.

The model is too small to generate good ads. But as we will see, we can use the Optimizing to the Eval pattern to train this model to produce good ad copy.

领英推荐

The Era of AI-Aided Design: Why Great Design Will…

J.B. Chaykowsky 3 周前

Digital Locusts

Katalin Bártfai-Walcott 5 天前

AI in Design

Pablo Povarchik 2 年前

Generate pairs of?ads

The first step is to generate pairs of ads. Let’s use the Generation Settings approach and choose a random temperature each time.

Here are two ads we got for the $300 bike when we did this:

Hey there! We’re looking for someone who’s ready to take their riding game to the next level with our 3-year-old Specialized bike. This bike is a great investment that will keep your kids engaged and safe all year round. It features durable components, a comfortable seat, and a powerful frame that can handle any terrain. Plus, it comes with a lifetime warranty, so you can be sure you’re getting a high-quality product. So why wait? Get yours today and start enjoying the thrill of riding on wheels!

and

Looking for a unique and stylish way to enjoy your daily commute? Look no further than the 3-year-old Specialized bike! This bike is perfect for those who value style over speed, and it’s priced at just $300. With its durable frame and high-quality components, this bike will last you years with minimal maintenance. Plus, it comes with a lifetime warranty, so you can rest easy knowing that you’re getting a quality product that won’t let you down. Don’t miss out on this opportunity to upgrade your bike experience today!

Compare two ads using LLM-as-a-judge

To compare the two ads, you can use the LLM-as-a-judge approach. Because evaluation is the most important step in this pattern, it is important to use a good evaluator. Let’s use a frontier LLM and have it output which ad is better, and the reason:

You are a professor of advertising at a business school.

Compare the two ads below for the same item being sold in a neighborhood marketplace and determine whether ad_a is better than ad_b

Also explain your reasoning.

The main criteria to compare the ads include:

* Is it clear what’s being sold? Age, brand, price, and condition are important.

* Does it target the most relevant audience for the item? Is the text persuasive to that audience?

* Is it concise and easy to read? An ideal ad is at most 3 sentences.

* Does it include contact information? Ideally, the ad specifies the preferred means of communication.

* Is the ad truthful? Remember that the item is likely used and not being sold by the manufacturer.

ad_a:

{ad_a}

ad_b:

{ad_b}

By optimizing to this eval, you are saying that we don’t care what style the ads are written in, as long as the ads are clear, relevant to the audience, short, actionable, and truthful. If writing the ads in rhyme is what makes the bike sell, you are perfectly okay with the LLM choosing to output rhymes.

Asking for the reason forces the LLM to do the comparison deliberatively (see the Chain-of-Thought design pattern covered later in this book). The output can be of this form:

@dataclass

class AdsComparison:

ad_a_is_better_than_ad_b: bool

reasoning: str

The result, when we did the comparison using Gemini 2.0 Flash, for the two ads whose text appears in the preceding section was:

AdsComparison

ad_a_is_better_than_ad_b=False

reasoning=”Both ads have issues, but ad_b is slightly better because it includes the price. Neither ad includes contact information. Both ads make the mistake of claiming the bike has a lifetime warranty, which is unlikely for a used bike being sold in a neighborhood marketplace. Ad_a is targeted toward children, but ad_b is targeted toward adults. Since the bike is used, it’s more likely to be purchased by an adult.”

The reasoning and comparison appear correct.

Generate training?dataset

You can create a single preference example consisting of the prompt, chosen content, and rejected content by putting together the preceding steps:

def create_preference_example(item: str, price: str) -> dict:
  ad1 = create_classified_ad(item, price)
  ad2 = create_classified_ad(item, price)
  score = score_ad(ad1, ad2)
  preference_example = {
    "prompt": SYSTEM_PROMPT + f"""Write an ad to sell a {item} priced at {price}"""
  }
  if score.ad_a_is_better_than_ad_b:
    preference_example['chosen'] = ad1
    preference_example['rejected'] = ad2
  else:
    preference_example['chosen'] = ad2
    preference_example['rejected'] = ad1
  preference_example['score_reason'] = score.reasoning
  return preference_example

Call this in a loop with a random selection from a set of items to create a jsonl file:

items_for_sale = [
  ("3-year old Specialized road bike", "$300"),
  ("Amazing Spider-Man 361", "$200"),
  …
  ("Set of 2 vintage Pyrex coffee mugs (in original box)", "$20"),
  ("Hand-carved wooden serving spoon", "$15")
]
def write_jsonl(num_examples: int, filename: str):
  examples = []
  for iter in range(num_examples):
    print(iter, end=" … ")
    item, price = random.choice(items_for_sale)
    example = create_preference_example(item, price)
    examples.append(example)
  with open(filename, "w") as ofp:
    for example in examples:
      json.dump(example, ofp)
    ofp.write('\n')

This file can be used for preference tuning.

Do preference tuning

To perform preference tuning, first load the dataset:

train_dataset = load_dataset('json', data_files="ad_preference_dataset.jsonl", split='train')
Then, create the model you wish to tune:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Finally, launch the DPOTrainer with this dataset and the model:

trainer = DPOTrainer(model=model, …, train_dataset=train_dataset)
trainer.train()

The training process for 100 examples, when we did it, took about 3 minutes on a machine with 8 vCPUs and a L4 GPU.

Inference

You can create a pipeline with the local trained model just as if the model were on the HuggingFace hub:

pipe = pipeline(
  task="text-generation",
  model="ClassifiedAds-DPO",
  use_fast=True,
  kwargs={
  "return_full_text": False,
  },
  model_kwargs={}
)

Then, create new content using the same prompt as before:

def create_classified_ad(item: str, price: str) -> str:
  system_prompt = SYSTEM_PROMPT
  user_prompt = f"""
  Write an ad to sell a {item} priced at {price}
  """
  input_message = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
  ]
  results = pipe(input_message,
    max_new_tokens=256,
    pad_token_id=pipe.tokenizer.eos_token_id
  )
  return results[0]['generated_text'][-1]['content'].strip()

Trying out the preference-tuned model for a book:

create_classified_ad("book Pachinko by Min Jin Lee", "$5")

yields:

Pachinko, the classic tale of a man\’s obsession with gambling and his love for a woman he meets while playing a pachinko game. A rare edition priced at $5. For more information or to arrange pickup, please contact [Your Name] at [Your Phone Number]. Thank you!

Zero-shot with the tuned model works! The tuned model is outputting clear, concise, and persuasive ads because that’s what the evaluation mechanism preferred.

Considerations

Optimizing to the Eval uses preference tuning to optimize a GenAI model to produce content in the style that has the best performance as determined by an evaluator.

Choosing between the?variants

There are several variants of the pattern that differ primarily on how the preference tuning dataset is created.

In Step 1, you can choose how to generate two pieces of content in response to the same prompt. While repeated generation can work, changing the settings or rewriting the prompt will typically yield content that is more meaningfully different and covers more of the style space.

In Step 2, you can choose how to perform the comparison of the two pieces of content to determine which one is better. The approach you take often depends on how the content will be used:

For user-facing content that will be consumed by end-users to perform a wide variety of tasks, “better” is pretty much in the eye of the beholder. Therefore, the human labeling approach might be the most appropriate.
For user-facing content that is meant to drive certain user behavior (even if it is as simple as for the user to read the content), measure the outcome. For example, for content meant to be read, you could measure the average time spent by users reading it and select the content that holds users’ attention longer. To avoid this metric being gamed, also measure user frustration with a metric such as abandonment and weight the two metrics.
For content meant to be consumed by automated applications, an Evaluator is often the most appropriate. The evaluator could be a tool that measures the content’s fit-for-purpose. For example, generated code could be run to check whether it compiles and how fast it runs.

The evaluation approach you take could be based on what you already have available. For example, if you already have defined metrics or a rubric, you could use them. If the metrics and rubrics are subjective (e.g. “is readable”) or requires content extraction (e.g. “includes contact info”), you can use LLM-as-a-judge. If you have an user interface that allows you to present a set of choices to users, then populate those choices with content pairs. If all queries are logged, and there is no reason that content to the same query (such as personalization) will vary in substance, consider using the matched pairs approach.

It is unlikely that you will have only one option available to you. Therefore, you may have to try multiple approaches and see which one gives you the best results.

In-distribution requirement

There are several places in this pattern where an LLM is being used:

To rephrase the prompt if you are doing prompt rewriting.
To create content from the prompt.
To perform evaluation if you are using LLM-as-judge
As input to the preference tuning

The content that is generated by the LLM (#2 in the list) has to be text that the LLM being trained (#4 in the list) can produce. No amount of preference tuning can teach the LLM new facts or new tokens, for example. There are two ways to ensure this in-distribution requirement, the easy way and the hard way:

The easy way to use the same LLM in both steps. In our example, we used the Qwen2 0.5B parameter model both to generate the content and as the model that was preference trained.
The hard way is to generate content using a bigger LLM, then instruction-tune the smaller model that will be preference tuned to be able to create that content. In other words, before doing DPO, do supervised training.

You will need the latter approach if the small model doesn’t have the capability to produce content that is adequate. Note that the content need not be good?—?the ads generated by the untuned Qwen2 model were not good, but they were ads and they could be evaluated using our evaluator. That was sufficient to tune Qwen2 to produce good ads.

Extension to?images

Although we discussed this pattern with just text examples, preference tuning can be carried out on images?—?for example, you can use DiffusionDPO to tune diffusion image models. Therefore, you can apply Style Optimization to create images in your preferred style.

For example, to train the runwayml Stable Diffusion model, create a pipeline:

pretrained_model_name = "runwayml/stable-diffusion-v1–5"
pipe = StableDiffusionPipeline.from_pretrained(
  pretrained_model_name,
  torch_dtype=torch.float16).to('cuda')

Use repeated generation (i.e. just run the prompt twice) to generate two images that differ in unknown ways:

for iter in range(2):
…
img = pipe(prompt=prompt, generator=generator, guidance_scale=7.5).images[0]
…

Now comes the key step?—?evaluation. How is one image better than another? Assuming that the images are being generated for a newsletter, you could use the Outcome method to select the image in a pair that drives more article clicks.

With the dataset of images created in this manner, you can launch the training script to create a preference-tuned Stable Diffusoin model.

accelerate launch - mixed_precision="fp16" train.py \
 - pretrained_model_name_or_path=$MODEL_NAME \
 - dataset_name=$DATASET_NAME \
…
 - output_dir="tmp-sd15"

The output of this model is now in the style that would drive more clicks.

Continuous improvement

You can improve quality by having a systematic approach to data collection, and by iterating on training.

Diverse prompts

DPO needs only a few high quality examples. In preference tuning, unlike with deep learning, quality of data trumps quantity. The keys to having high-quality examples are twofold:

High quality evaluator. Your evaluator needs to be capable of observing nuanced differences between the two pieces of content in a pair.
Diverse prompts. Your prompts need to cover the entire input space of what your LLM will be required to handle when deployed.

We have already discussed the importance of having a high-quality evaluator.

How do you obtain diverse prompts? One approach is to deploy your solution early and start to collect feedback. Log both the prompts and feedback about the responses. Add any prompt that causes subpar responses to the set of prompts that you use to create your preference training dataset. Carry out topic modeling and other techniques to identify outlier prompts. Add these outliers to the training dataset. Finally, have a systematic approach to managing bugs and feedback. Add these prompts also to your training dataset.

Iterative training

We described the Optimizing to the Eval pattern as having four steps, the first three of which are about creating a preference dataset and the fourth about invoking a DPO trainer. At the end of training, you are left with a tuned model that produces content in the desired style. DPO training on a hundred examples takes only seconds. If your evaluation method is automated (i.e., doesn’t involve human labeling), all four steps of the pattern can be done quickly.

If you have a large prompt library or a number of easily verifiable problems, you can do multiple iterations of training.

If you can perform all four steps quickly, training need not be a once-and-done process. You can repeat the four steps with the newly trained LLM, to create a preference dataset where all the examples are improved over the original. Then, invoke the DPO trainer to obtain an LLM that is an improvement over the one you used to create the content in the examples. Of course, the process will saturate at some point, but such meta-training gives you the ability to maximize the quality that an LLM of that size is capable of achieving at a very reasonable cost.

You will need a larger library of examples in order to do multiple training runs over your prompt library. You could consider increasing the number of iterations once you have a deployed system and have been logging user prompts for a while. Another way to get a larger library of examples is to train the LLM on prompts associated with many different problems. The ability to get substantial improvements using iterative DPO on synthetic data corresponding to easily verifiable problems was one of the key aha moments on DeepSeek. You can consider the same approach if you have a number of easily verifiable problems and can use these to expand the training dataset.

The important ingredients to set up a continuous improvement cycle are a fast and high-quality evaluator, a systematic approach to collect and manage prompts, and training until saturation is reached.

Next steps

Code for this pattern: https://github.com/lakshmanok/generative-ai-design-patterns/tree/main/examples/style_optimization
Draft catalog of patterns: https://htmlpreview.github.io/?https://github.com/lakshmanok/generative-ai-design-patterns/blob/main/catalog.htm

Hannes Hapke and I are writing an O’Reilly book on GenAI Design Patterns. This is Pattern #5 in the book. We would love feedback on the content itself, but also on the structure of the pattern discussion.

Creating value with Data & ML

5,586 位关注者

Sai Kumar Popuri

Data & AI leader | Retail | Experienced C-suite Partner | PhD (Statistics)

1 个月

Isn't that what we do in machine learning/stats - 'hack' the model to estimate the parameters that the evaluation method, aka loss fn, considers great! Not so much of a negative thing as you mention, but then the more robust the eval criterion, in general the more challenging it would be to come up with an optimal soln. Congratulations and thank you for writing a book on GenAI design patterns! Need of the hour. Looking forward ...

要查看或添加评论，请登录

Valliappa Lakshmanan的更多文章

Evaluation-Driven Development for agentic applications using PydanticAI

2024年12月20日

Evaluation-Driven Development for agentic applications using PydanticAI

An open-source, model-agnostic agentic framework that supports dependency injection One of the ways to handle the core…

9 条评论
A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

2024年10月2日

A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

Look at any LLM tutorial and the suggested usage involves invoking the API, sending it a prompt, and using the…

6 条评论
Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

2024年9月22日

Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

Like most everyone, I was flabbergasted by NotebookLM and its ability to generate a podcast from a set of documents…

8 条评论
What goes into bronze, silver, and gold layers of a medallion data architecture?

2024年9月18日

What goes into bronze, silver, and gold layers of a medallion data architecture?

Here's a four-layer medallion architecture that explicitly addresses data governance and separation-of-responsibility…

27 条评论
How to Implement a GenAI Agent using Autogen or LangGraph

2024年7月31日

How to Implement a GenAI Agent using Autogen or LangGraph

Comparing Autogen and LangGraph from a developer standpoint GenAI models are good at a handful of tasks such as text…

24 条评论
Corporate blogs in the age of AI

2024年3月21日

Corporate blogs in the age of AI

Due to the infusion of AI in both the production and the consumption of marketing content, the way you write product…
Building an AI Assistant with DSPy

2024年3月7日

Building an AI Assistant with DSPy

A way to program and tune prompt-agnostic LLM agent pipelines I hate prompt engineering. For one thing, I do not want…

28 条评论
Getting value out of Voice AI one stage at a?time

2023年5月16日

Getting value out of Voice AI one stage at a?time

Improving the customer experience through automation To improve the customer experience your product provides, you need…

2 条评论
Four Approaches to build on top of Generative AI Foundational Models

2023年3月21日

Four Approaches to build on top of Generative AI Foundational Models

What works, the pros and cons, and example code for each approach If some of the terminology I use here is unfamiliar…

11 条评论
How to treat Data as a Product

2023年2月10日

How to treat Data as a Product

Maximize the leverage you get from data by applying product management principles Many organizations aspire for their…

9 条评论

See all articles

Problem

Solution

Step 1: Generate Pairs of?Content

Repeated generation

Change generation settings

Prompt rewriting

Step 2: Compare the two content?pieces

Variant 1: Human?Labeling

Variant 2: Evaluator

Variant 3: Better?Outcome

Step 3: Create a training?dataset

Step 4: Do Preference Tuning

Example

Zero-shot doesn’t?work

领英推荐

Generate pairs of?ads

Compare two ads using LLM-as-a-judge

Generate training?dataset

Do preference tuning

Inference

Considerations

Choosing between the?variants

In-distribution requirement

Extension to?images

Continuous improvement

Diverse prompts

Iterative training

Next steps

Creating value with Data & ML

5,586 位关注者

Valliappa Lakshmanan的更多文章

Evaluation-Driven Development for agentic applications using PydanticAI

A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

What goes into bronze, silver, and gold layers of a medallion data architecture?

How to Implement a GenAI Agent using Autogen or LangGraph

Corporate blogs in the age of AI

Building an AI Assistant with DSPy

Getting value out of Voice AI one stage at a?time

Four Approaches to build on top of Generative AI Foundational Models

How to treat Data as a Product

社区洞察

其他会员也浏览了

Graphical User Interface

Revolutionizing Website Design and User Experience

The Intriguing Promise of AI to the World of Web Designing

Why Designing Understandable Visual Displays is So Difficult...The Minard Napoleon March Example & The Bloomberg Terminal!

The Power of AI in Front-End Development: Transformations in 2024

The Value of AI in Web Design and Development

Delivering Draft 045-1

Designer Input - 05??How to Generate Images Directly In Google Search?

Revolutionizing Web Design: The Intersection of AI and Automation

Delivering Draft 053-1