o1 has sparked tons of ideas for applying LLMs to reasoning problems in science and math, but one of the most interesting applications IMO is prompt optimization…
TL;DR: Prompt engineering is a black box even for recent frontier models–slight changes in prompts lead to big differences. Automatic prompt engineering (i.e., using an LLM to optimize a prompt) is one of the best tools for solving this black box, but it requires an LLM with very good reasoning capabilities. The proposal of o1–and its ability to leverage increased inference time compute for better reasoning–unlocks new potential for automatic prompt engineering.
What is automatic prompt optimization? There are several papers that have been published on using LLMs to propose better / improved prompts; e.g., APE [1] and OPRO [2]. I’m referring to these approaches as automatic prompt optimization techniques. The underlying idea here is to use an LLM to refine prompts that are sent to another LLM.
How does this work? Most of papers on automatic prompt optimization follow a similar approach:
1. Construct a “meta prompt” that asks the LLM to write a new prompt based on prior context (i.e., previous prompts and their performance metrics).
2. Generate new prompts with an “optimizer” LLM.
3. Evaluate these prompts using another LLM, producing an objective value / score.
4. Select prompts with the best scores.
5. Repeat steps 1-4 until we can’t find a better prompt.
Notably, the optimizer LLM and the LLM used for evaluation do not need to be the same! We could use o1 as an optimizer that finds better prompts for other LLMs.
Practical details. To make this approach work well, we need to include the correct information in our meta prompt In [2], authors propose including i) a description of the task, ii) few-shot examples from the task iii) prior prompts, iv) the performance of prior prompts, and v) general constraints for the prompt. Given the correct context, we can generate high-performing prompts pretty easily.
Does this work? Interestingly, LLMs seem to be very good at inferring new / better prompts from prior context. For both APE and OPRO, the automatic prompt engineering system is able to discover new prompts that outperform those written by humans. Plus, the prompts produced by these systems can reveal interesting tricks / takeaways for how to prompt certain models properly. These takeaways even generalize to other tasks in many cases.
How does this relate to o1? The performance of automatic prompt engineering is heavily dependent upon the optimizer LLM’s reasoning capabilities. This LLM must be able to ingest prior prompt information and objective values, then infer new prompts that will perform well. This is a complex reasoning problem. As such, spending more on compute at inference time could potentially lead the LLM to discover more and better patterns for successful prompting.