Edition 35 - Creating Self-Improving LLM Evals
This month's edition of the Evaluator is packed with cutting-edge insights and practical know-how from our team. This time, we cover self-improving evals, dive into OTel, chat about OpenAI's Swarm, and more. There's also information about our ongoing agent series.
As always, we conclude with some of our favorite news, papers, community threads, and upcoming events.
Techniques for Self-Improving LLM Evals
If you’ve implemented a series of LLM-based evaluations or unit tests, but aren’t sure if your methods are robust, this guide by Eric Xiao is for you. In this article, we cover how to create self-improving LLM evals by following a systematic approach. Dive in.
Tracing and Evaluating LangGraph Agents
LangGraph is a powerful library designed for building stateful, multi-actor applications within large language models. In this post, Greg Chase covers how LangGraph’s traces can be ingested into Arize, and how to leverage LLMs as a judge to evaluate LangGraph agent performance. Read it here.
The Role of OpenTelemetry in LLM Observability
Comprehensive piece on the role of OpenTelemetry in LLM observability, including a comprehensive overview of OTel. Dat Daryl Ngo wrote this based on his experience working alongside customers who have productionized real consumer facing LLM applications with real business ROI. Dive into OTel here.
Arize + Vertex AI API
In leveraging an AI observability and evaluation platform like Arize AI with the advanced capabilities of Google’s suite of AI tools, enterprises looking to push the boundaries of what’s possible with their AI applications have a robust, compelling option. By Gabe Barcelos . Read it here.
Swarm: OpenAI's Experimental Approach to Multi-Agent Systems
In this paper read, John Gilhuly and Xander Song discuss Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started. Learn more about Swarm.
Tracing LLM Function Calls
A quick demo of how to trace LLM function calls in Arize. Eric Xiao shows you how to trace OpenAI function calls for enhanced debugging and structured outputs, and how function calling enables LLMs to interact with external tools and return structured data for tasks like summarization, classification, and code transformation. Watch the video.
Object Detection Modeling
A quick demo of the object detection modeling and the capabilities Arize has around computer vision by Duncan McKinnon . Get a better idea of what's going on in your CV datasets and what's underperforming. Watch the video.
Register for our Agents Workshop
Join us as we walk through a 5-part series of real-life agents deployed in production. We’ll deep dive into the architectures of these agents, the systems used in their development, and lessons learned from using them in production. Each week, we’ll unpack a new example agent or agent component used in a real-world agent. Register here.
Staff Picks
Here's a roundup of our team's favorite news, research, threads, and things to do.