Anthropic Agentic Systems - #5. Evaluator-Optimizer
Anthropic Agentic Systems: A Five-Part Exploration is sponsored by Agent.ai - Discover, connect with and hire AI agents to do useful things.
The world of GenAI has seen big step changes since ChatGPT was released in November of 2022 (yes it's just a 2 year old). 2023 was the year of Chatbots, Copilots and Assistants. 2024 was the year that Agents and Agentic Systems broke on to the scene. 2025 is the year that Reasoning models are becoming the norm. The term LRM (Large Reasoning Model) is being thrown around.
In AI, "LRM" stands for "Large Reasoning Model," referring to a type of artificial intelligence model that is specifically designed to perform complex reasoning tasks, going beyond simple text generation to analyze situations, deduce logic, and make informed decisions, mimicking human-like thinking abilities more closely than traditional language models.?
Just a short while ago that "simple Text, Image and Audio Generation" aka Generative AI was mind blowing. That it generated a Valentine's day Haiku in 2023 was a cause for wonder and now we take it for granted and call it simple. In fact LLMs are now "traditional" and not cool anymore!!!
In fact OpenAI announced their roadmap recently and the key takeaway was that Reasoning Models (what they call Simulated Reasoning - SR models) will not be separate and exclusive, but part of the core offering and they will simplify to one offering.
After that, GPT-5 will be a system that brings together features from across OpenAI's current AI model lineup, including conventional AI models, SR models, and specialized models that do tasks like web search and research. "In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3,
It's a long preamble to introduce our 5th pattern in Anthropic's 5 patterns of Agentic Systems. As, this pattern gets into what is happening under the covers with these Reasoning models. To recap the 5 patterns -
We are going to talk about the Evaluator-Optimizer pattern in this article and hopefully it provides some insights into how the reasoning models are able to review their output and then iterate on them to make them better. In fact by combining Orchestrator-Workers and Evaluator-Optimizers one can build their own Chain of Thought reasoning.
In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop. This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value.
As you can see in the visual above, we have an LLM that does some work, generates an output that is then passed to an Evaluator that evaluates the output and if it is acceptable it moves on, else it is sent back with Feedback and then a new output is generated and evaluated again.
There are 2 important pieces to the evaluator - It needs a good evaluation criteria and it needs to give clear and actionable feedback that can be used by the Regenerate activity to produce a better output the second time.
To showcase this pattern in action, I will keep with the Superhero Bio theme where I take a LinkedIn profile and generate a Superhero Bio agent in Agent.ai. The one difference here is that the Bio that is generated is passed to an evaluator step that rates it and generates feedback, if it is not good it is passed back and iterated on. Here is a visual of this in action.
You can see the Loop that is the evaluator-optimizer. This is implemented as a For loop in Agent.ai with an embedded IF loop to check the rating that the evaluator generates. We stay in this loop (upto a max of 3 times) till the output is highly 'rated' - >=4 and once that is achieved, we exit the loop.
Here is how the agent is implemented -
In the above agent
What does the final output look like (this is a very corny example, but I wanted something silly that people can relate to)
Here is the first attempt at generating the bio
Here is the feedback and rating (3)
Here is the prompt I wrote to evaluate the bio. As you can see this is a very simple evaluator. To make it really good, one would give it much more specific instructions on what to evaluate it on.
You are a Marvel Comics superhero expert with a deep understanding of what makes a legendary origin story. You have been given a Superhero Biography, generated from a person's LinkedIn profile, and your mission is to critically evaluate it.
Your review should be extremely detailed and honest—judging whether the bio captures the excitement, depth, and grandeur worthy of a true Marvel hero. Rate the bio on a scale of 1-5:
5: A masterpiece—Stan Lee himself would be proud!
1: A disaster—this needs a complete overhaul.
In addition to the rating, provide constructive, specific and actionable feedback that helps refine and elevate the bio, making it more compelling. Do not hesitate to ask for a rewrite if that is needed.
Generate a JSON object called "review" with the following attributes. Return just the JSON object and nothing else:
"rating": The score (1-5) based on quality.
"feedback": Specific and actionable advice for improving the bio.
Example JSON Output:
"review": {
"rating": "the rating",
"feedback": "the feedback"
}
Now, analyze the following Superhero Biography and generate your review:
Superhero Bio:
{{out_bio}}
Here is the final bio that was accepted by the Evaluator. Much more detailed and the story is more built out. It will not win any prizes and as you all know the quality of the output is based on the quality of the prompting.
While LRM's like OpenAI o1 pro, o3 and DeepSeek R1 and Gemini DeepResearch are all here to stay and will be leveraged widely, they are black boxes when it comes to the reasoning they do and thus the output they generate. They will keep getting better, but I think there will be a world where we will have these 'autonomous' LRMs AND also custom agents, where we control the data that is accessed, the criteria for the evaluation, the output that is generated. So learning these patterns will be critical for agent builders. At minimum it will help you prompt the reasoning models in a better way as you understand what makes them tick.
So go sign up for Agent.ai and start building some agents and play around with these patterns.
Founder @Agentgrow | 3x Head of Sales
1 周Totally. adding 1 self reflection step is really useful