Agent-as-a-Judge: Evaluate Agents with Agents

Agent-as-a-Judge: Evaluate Agents with Agents

Agent-as-a-Judge: Revolutionizing the Evaluation of Agentic Systems

In recent years, the field of artificial intelligence has witnessed remarkable advancements, particularly in the development of multimodal agentic systems. These systems, designed to solve complex real-world problems, have moved from theoretical concepts to practical applications. However, the evaluation methods for these systems have not kept pace with their rapid development. Traditional evaluation techniques, which focus solely on final outcomes, fail to capture the step-by-step nature of agentic systems and often require excessive manual labor. To address these challenges, researchers have introduced the innovative Agent-as-a-Judge framework.

The Need for Agent-as-a-Judge

The primary issue with contemporary evaluation techniques is their inability to provide rich feedback during the intermediate stages of task-solving. Agentic systems, much like humans, operate in a step-by-step manner and often use symbolic communications internally to solve problems. Evaluating these systems solely based on final outcomes is akin to assessing a student's knowledge through multiple-choice tests—an unreliable estimator. Moreover, human evaluation, while reliable, is time-consuming and requires substantial expertise, making it impractical for large-scale applications.


Introducing Agent-as-a-Judge

The Agent-as-a-Judge framework leverages agentic systems to evaluate other agentic systems, providing a cost-effective and scalable solution. This framework extends the LLM-as-a-Judge approach by incorporating features that enable intermediate feedback throughout the task-solving process. By mimicking human evaluators, Agent-as-a-Judge offers a more accurate and reliable evaluation method.

Proof-of-Concept and Components


The proof-of-concept for Agent-as-a-Judge consists of eight modular components:

1. Graph Module: Constructs a graph capturing the entire project structure, including files, modules, and dependencies.

2. Locate Module: Identifies specific folders or files referred to by a requirement.

3. Read Module: Supports reading and understanding multimodal data across various formats, allowing cross-referencing of data streams.

4. Search Module: Provides contextual understanding of code and retrieves relevant snippets.

5. Retrieve Module: Extracts information from long texts, identifying relevant segments.

6. Ask Module: Determines whether a given requirement is satisfied.

7. Memory Module: Stores historical judgment information for building on past evaluations.

8. Planning Module: Plans subsequent actions, strategizing based on the current state and project goals.

Judge Shift

Judge Shift measures the deviation from the Human-as-a-Judge consensus results, with lower values indicating closer alignment. Agent-as-a-Judge consistently outperforms LLM-as-a-Judge across tasks, particularly those with task dependencies. For example, in Requirement (I), Agent-as-a-Judge shows a Judge Shift as low as 0.27%, while LLM-as-a-Judge reaches 31.24% for OpenHands. This underscores Agent-as-a-Judge’s stability and suitability for meeting task requirements. Furthermore, in the gray-box setting, both Agent-as-a-Judge and LLM-as-a-Judge show even better results than their performance in the black-box setting.

Alignment Rate

The Alignment Rate reflects how closely the AI Judges’ evaluations align with human consensus across all 365 requirements. It is defined as the percentage of requirement evaluations that match the Human-as-a-Judge consensus evaluation. Compared to LLM-as-a-Judge, Agent-as-a-Judge consistently achieves a higher Alignment Rate, closely matching human judgments. For example, when evaluating OpenHands, Agent-as-a-Judge reaches 92.07% and 90.44%, surpassing LLM-as-a-Judge’s 70.76% and 60.38% in both gray-box and black-box settings. This shows that Agent-as-a-Judge produces more accurate and human-aligned evaluations, especially in complex scenarios.

PR Curves

Judging developer agents is a class-imbalanced task, where meeting requirements is much rarer than failing. Metrics like judge shift and alignment rate can be misleading. For example, since MetaGPT rarely meets requirements, LLM-as-a-Judge easily identifies most cases as negative (achieving 84.15% in the black-box setting). PR Curves offer a clearer performance measure by balancing precision and recall. Agent-as-a-Judge even outperforms any single human evaluator on OpenHands and aligns closest with majority voting. This shows that, in some cases, Agent-as-a-Judge can nearly replace human evaluators.


Ablations For Agent-as-a-Judge


1. Ask Component: With only the ask component, the agent achieves a 65.03% alignment rate.

2. Graph Component: Adding the graph component increases performance to 75.95%, as the agent can better understand the relationships between files.

3. Read Component: The introduction of the read component further improves the alignment rate to 82.24%, reflecting the value of direct access to the contents of the file.

4. Locate Component: Incorporating the locate component brings a substantial boost to 90.44%, as the agent can efficiently target files relevant to the requirements.

5. Retrieve Component: Adding the retrieve component does not provide a significant benefit in this case. However, it is useful for other systems like MetaGPT and GPT-Pilot, as it provides additional valuable information.

Evaluation and Results

The evaluation of three open-source code-generating agentic frameworks (MetaGPT, GPT-Pilot, and OpenHands) using human judges, LLM-as-a-Judge, and Agent-as-a-Judge revealed that Agent-as-a-Judge aligns more closely with human judges (90%) compared to LLM-as-a-Judge (70%). Additionally, Agent-as-a-Judge is more cost-effective, saving 97.72% of the time and 97.64% of the cost compared to human evaluation.


Reducing Human Judgment Errors

Human judgment errors are inevitable, but the authors suggest two methods to reduce them:

1. Debate Round: Introduce a debate round after each judgment, where individuals present evidence and either persuade others or adjust their opinions after discussion.

2. Larger Panel of Experts: Assemble a larger panel of experts and rely on a majority vote. However, due to the high cost of engaging more experts, the debate round is argued to be a more practical solution.


Conclusion

The Agent-as-a-Judge framework marks a significant step forward in the evaluation of agentic systems. By providing rich and reliable feedback throughout the task-solving process, it offers a scalable and cost-effective alternative to traditional evaluation methods. The introduction of DevAI and the successful application of Agent-as-a-Judge to code generation demonstrate its potential to revolutionize the evaluation of modern agentic systems, paving the way for dynamic and scalable self-improvement.

For more details, you can refer to the full paper [here](https://arxiv.org/pdf/2410.10934).

要查看或添加评论,请登录

Bazeed Shaik的更多文章

社区洞察

其他会员也浏览了