Phased Approach | All about Evals

Phased Approach | All about Evals

The Hidden Truth About Successful AI Products

Evals can be more valuable to a product company than their own codebase. This might sound like a bit of ridiculous hyperbole, but when you dig into the product knowledge an Eval dataset contains about your product, the statement starts to make sense. While most businesses focus on selecting the best AI models or perfecting their prompts, there's a behind-the-scenes secret that top AI companies don't often talk about: their evaluation systems, or "evals."

Y Combinator CEO Garry Tan recently revealed that "evals are emerging as the real moat for AI startups." Even more telling, one of the fastest companies to reach a $100M run rate publicly talks about having great "taste," but privately credits their success to "ruthless evals."



Why Evals Matter for Your Business AI

Let's be clear: we're not talking about general AI benchmarks like HELM or MMLU that rank different AI models. Those are important for AI research, but for businesses building real applications, what matters are custom evals tailored to your specific use cases.

Think about it: ChatGPT might score well on academic tests, but can it correctly handle your company's unique:

  • Product documentation
  • Customer service scenarios
  • Industry regulations
  • Internal policies
  • Domain-specific knowledge

This is why some founders now consider their eval datasets more valuable than their actual code. They've discovered that the key to building AI applications that don't hallucinate isn't just about using the latest model—it's about rigorous, business-specific testing.

What Gerry and Anjney are saying here is that Evals are why Cursor is so good.


He is referring to Cursor


Finding Your Business's "Underserved Slices"

As Garry Tan points out, the real insights come from "founders acting almost as ethnographers spelunking in the underserved slices of the GDP pie chart." In practical terms, this means:

  1. Identifying your business's unique AI needs
  2. Understanding where general AI models fall short
  3. Building comprehensive test cases for these scenarios
  4. Creating a moat through specialized evaluation data


How to Build Your First Eval Dataset (It's Simpler Than You Think)

Let's break it down with a real example. Say you're building an AI chatbot to help employees with company policies. Here's how you'd create your first eval:

  1. Start With a Simple Spreadsheet

Here's exactly what your eval test-case might look like multiples of these are a dataset:



Gather Real Scenarios Document common questions from users Include the exact policy text they should reference Write down the ideal response Test your AI and record its answers

From Simple Spreadsheets to Sophisticated Automation

Here's the industry secret: while fancy AI companies make evaluation sound complex, you can start incredibly simple. Let's look at the progression:

Level 1: The "Eyeball Test"

Start with a basic spreadsheet like the one above. You or your team can manually review AI responses and mark them correct or incorrect. This is perfectly fine for getting started! Many successful AI products began this way.

Level 2: LLM as Judge

Here's where it gets interesting. Instead of manually checking each response, you can use another AI model as an automated judge. Here's how it works:

Your AI: "You need to submit your expense report within 30 days after completing your travel."

Judge LLM Prompt: "Compare this response to the correct answer: 'You must submit your expense report within 30 days after the travel date.'
Is the response:
1. Factually correct?
2. Complete?
3. Appropriately phrased?
Score out of 100 and explain why."

        

This allows you to automatically evaluate hundreds or thousands of responses quickly and consistently.

Level 3: Agents as Sophisticated Judges

The most advanced companies take this even further. They create specialized AI agents that act as comprehensive testing systems. Here's what these advanced eval agents can do:

1. Conversation Flow Testing

  • Simulate entire user conversations, not just single questions
  • Test different conversation paths ("conversation tree testing")
  • Verify that the AI maintains context over multiple exchanges
  • Check if the AI appropriately handles topic switches
  • Ensure the AI can gracefully recover from misunderstandings

2. User Behavior Simulation

  • Act as different user personas (novice, expert, frustrated customer)
  • Test responses to unclear or ambiguous questions
  • Simulate interruptions and conversation restarts
  • Check handling of urgent vs. routine requests
  • Test responses to different communication styles

3. Policy and Safety Compliance

  • Verify responses align with company policies
  • Check for unauthorized information disclosure
  • Test handling of sensitive information requests
  • Ensure consistent enforcement of usage guidelines
  • Monitor for subtle policy violations across conversations

4. Advanced Reasoning Verification

  • Check mathematical calculations and logic
  • Verify citations and source references
  • Test step-by-step explanation quality
  • Evaluate problem-solving approaches
  • Assess accuracy of technical recommendations

5. Dynamic Test Generation

  • Create new test cases based on real user interactions
  • Generate variations of existing test scenarios
  • Identify potential edge cases automatically
  • Create adversarial tests to probe system limitations
  • Build comprehensive test suites for specific domains

6. Quality and Consistency Checks

  • Monitor response tone and professionalism
  • Check for consistency across similar questions
  • Verify appropriate use of technical terminology
  • Test for cultural sensitivity and appropriateness
  • Ensure responses maintain brand voice guidelines

This sophisticated testing approach is why companies like Cursor, Harvey, and Perplexity can maintain such high quality at scale. Their eval agents are constantly running thousands of these tests, helping them catch issues before they reach users and continuously improve their AI's performance.

For example, when testing a customer service AI, an eval agent might:

  1. Start as a frustrated customer with a billing issue
  2. Test the AI's ability to remain professional while gathering information
  3. Verify that the solution provided matches company policy
  4. Check if the AI appropriately escalates when needed
  5. Ensure all relevant account security protocols are followed

This level of testing might seem extensive, but it's what separates consistently reliable AI products from those that occasionally fail in embarrassing ways.

The Secret to Better Tests: User Feedback Loop

The most successful companies don't stop at their initial test cases. They create a virtuous cycle:

  1. Collect Real User Interactions Log actual user questions Note which answers worked (and which didn't) Record edge cases and unexpected queries
  2. Update Your Eval Dataset Add new test cases based on real usage Include examples of both successes and failures Continuously expand your testing scenarios
  3. Automate and Scale Use AI to generate variations of your test cases Run tests automatically when you make changes Track improvements over time

Why This Matters for Your Business

Above we mentioned Garry Tan recently pointed out that "evals are emerging as the real moat for AI startups." Why? Because good evals create a feedback loop that's hard to replicate each dataset becomes a unique map of product successes:

  • Better testing → Better product
  • Better product → More users
  • More users → More real-world data
  • More data → Even better testing

This is why some of today's fastest-growing AI companies (including one that reached a $100M run rate in record time) credit their success not to their public image of having great "taste," but to their "ruthless evals" culture behind the scenes.

Getting Started Today

You don't need sophisticated tools to start. Begin with:

  1. A simple spreadsheet of test cases
  2. Real questions from your users
  3. Clear criteria for correct answers
  4. Regular testing and updates

The key is to start small but be systematic. As your needs grow, you can gradually automate and scale your evaluation process.

Phased AI help companies to create custom evals and track them on our end to end trust platform. If you would like some advise on creating Evals contact us for help. We will be happy to help you progress from spreadsheets to fully automated testing.

Our specialized tool, Phased Loop helps you build evaluations into your CI/CD pipeline and track your evals across all your products


As AI continues to evolve, robust evaluation will only become more critical. Companies that master this process now will have a significant advantage. Remember: while everyone else is chasing the latest AI breakthrough, the real winners are quietly building their eval moats, one test case at a time.

Kiara Till

Helping Remote Recruitment streamline operations & improve efficiency to connect top talent with leading businesses.

4 天前

This is a fantastic breakdown of why evals are the hidden competitive advantage in AI. The shift from focusing solely on model selection to rigorous, domain-specific evaluation is a game-changer. For businesses integrating AI, the lesson is clear: your moat isn’t just the data you train on—it’s how well you test and refine your AI against real-world scenarios. Whether you're starting with a spreadsheet or scaling to automated eval agents, the key is continuous iteration. The companies winning in AI aren’t just building smarter models; they’re building smarter evaluation systems. That’s where sustainable differentiation truly lies.

回复
Adegbenga Adefemi

??I help #small business owners bring their businesses to profitability with the help of AI?? #digitalmarketing #ai #automation. Want more clients??Click the link below to scale your business today??

4 天前

While AI models and prompts get all the attention, it’s the evaluation systems that determine real-world success. The best AI applications aren’t just ‘smart’—they’re consistently reliable, user-aligned, and continuously improving. Companies that master ruthless evaluation loops will build the strongest competitive moats in AI-driven business. Question: What’s the biggest challenge startups face when setting up effective AI eval systems—data quality, defining the right metrics, or something else?

回复
Prabhin US

Skilled Product Owner with project management expertise in different SDLC methodologies, domains, & technologies.

4 天前

Great Article Richard. Evals are definitely a game changer for AI products. A strong evaluation system makes all the difference in ensuring reliable results.

要查看或添加评论,请登录

Richard Skinner的更多文章