Phased Approach | All about Evals
Richard Skinner
CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI
The Hidden Truth About Successful AI Products
Evals can be more valuable to a product company than their own codebase. This might sound like a bit of ridiculous hyperbole, but when you dig into the product knowledge an Eval dataset contains about your product, the statement starts to make sense. While most businesses focus on selecting the best AI models or perfecting their prompts, there's a behind-the-scenes secret that top AI companies don't often talk about: their evaluation systems, or "evals."
Y Combinator CEO Garry Tan recently revealed that "evals are emerging as the real moat for AI startups." Even more telling, one of the fastest companies to reach a $100M run rate publicly talks about having great "taste," but privately credits their success to "ruthless evals."
Why Evals Matter for Your Business AI
Let's be clear: we're not talking about general AI benchmarks like HELM or MMLU that rank different AI models. Those are important for AI research, but for businesses building real applications, what matters are custom evals tailored to your specific use cases.
Think about it: ChatGPT might score well on academic tests, but can it correctly handle your company's unique:
This is why some founders now consider their eval datasets more valuable than their actual code. They've discovered that the key to building AI applications that don't hallucinate isn't just about using the latest model—it's about rigorous, business-specific testing.
What Gerry and Anjney are saying here is that Evals are why Cursor is so good.
Finding Your Business's "Underserved Slices"
As Garry Tan points out, the real insights come from "founders acting almost as ethnographers spelunking in the underserved slices of the GDP pie chart." In practical terms, this means:
How to Build Your First Eval Dataset (It's Simpler Than You Think)
Let's break it down with a real example. Say you're building an AI chatbot to help employees with company policies. Here's how you'd create your first eval:
Here's exactly what your eval test-case might look like multiples of these are a dataset:
Gather Real Scenarios Document common questions from users Include the exact policy text they should reference Write down the ideal response Test your AI and record its answers
From Simple Spreadsheets to Sophisticated Automation
Here's the industry secret: while fancy AI companies make evaluation sound complex, you can start incredibly simple. Let's look at the progression:
Level 1: The "Eyeball Test"
Start with a basic spreadsheet like the one above. You or your team can manually review AI responses and mark them correct or incorrect. This is perfectly fine for getting started! Many successful AI products began this way.
Level 2: LLM as Judge
Here's where it gets interesting. Instead of manually checking each response, you can use another AI model as an automated judge. Here's how it works:
Your AI: "You need to submit your expense report within 30 days after completing your travel."
Judge LLM Prompt: "Compare this response to the correct answer: 'You must submit your expense report within 30 days after the travel date.'
Is the response:
1. Factually correct?
2. Complete?
3. Appropriately phrased?
Score out of 100 and explain why."
This allows you to automatically evaluate hundreds or thousands of responses quickly and consistently.
Level 3: Agents as Sophisticated Judges
The most advanced companies take this even further. They create specialized AI agents that act as comprehensive testing systems. Here's what these advanced eval agents can do:
1. Conversation Flow Testing
2. User Behavior Simulation
3. Policy and Safety Compliance
4. Advanced Reasoning Verification
5. Dynamic Test Generation
6. Quality and Consistency Checks
This sophisticated testing approach is why companies like Cursor, Harvey, and Perplexity can maintain such high quality at scale. Their eval agents are constantly running thousands of these tests, helping them catch issues before they reach users and continuously improve their AI's performance.
For example, when testing a customer service AI, an eval agent might:
This level of testing might seem extensive, but it's what separates consistently reliable AI products from those that occasionally fail in embarrassing ways.
The Secret to Better Tests: User Feedback Loop
The most successful companies don't stop at their initial test cases. They create a virtuous cycle:
Why This Matters for Your Business
Above we mentioned Garry Tan recently pointed out that "evals are emerging as the real moat for AI startups." Why? Because good evals create a feedback loop that's hard to replicate each dataset becomes a unique map of product successes:
This is why some of today's fastest-growing AI companies (including one that reached a $100M run rate in record time) credit their success not to their public image of having great "taste," but to their "ruthless evals" culture behind the scenes.
Getting Started Today
You don't need sophisticated tools to start. Begin with:
The key is to start small but be systematic. As your needs grow, you can gradually automate and scale your evaluation process.
Phased AI help companies to create custom evals and track them on our end to end trust platform. If you would like some advise on creating Evals contact us for help. We will be happy to help you progress from spreadsheets to fully automated testing.
Our specialized tool, Phased Loop helps you build evaluations into your CI/CD pipeline and track your evals across all your products
As AI continues to evolve, robust evaluation will only become more critical. Companies that master this process now will have a significant advantage. Remember: while everyone else is chasing the latest AI breakthrough, the real winners are quietly building their eval moats, one test case at a time.
Helping Remote Recruitment streamline operations & improve efficiency to connect top talent with leading businesses.
4 天前This is a fantastic breakdown of why evals are the hidden competitive advantage in AI. The shift from focusing solely on model selection to rigorous, domain-specific evaluation is a game-changer. For businesses integrating AI, the lesson is clear: your moat isn’t just the data you train on—it’s how well you test and refine your AI against real-world scenarios. Whether you're starting with a spreadsheet or scaling to automated eval agents, the key is continuous iteration. The companies winning in AI aren’t just building smarter models; they’re building smarter evaluation systems. That’s where sustainable differentiation truly lies.
??I help #small business owners bring their businesses to profitability with the help of AI?? #digitalmarketing #ai #automation. Want more clients??Click the link below to scale your business today??
4 天前While AI models and prompts get all the attention, it’s the evaluation systems that determine real-world success. The best AI applications aren’t just ‘smart’—they’re consistently reliable, user-aligned, and continuously improving. Companies that master ruthless evaluation loops will build the strongest competitive moats in AI-driven business. Question: What’s the biggest challenge startups face when setting up effective AI eval systems—data quality, defining the right metrics, or something else?
Skilled Product Owner with project management expertise in different SDLC methodologies, domains, & technologies.
4 天前Great Article Richard. Evals are definitely a game changer for AI products. A strong evaluation system makes all the difference in ensuring reliable results.