登录查看更多内容

Phased Approach | All about Evals

Richard Skinner

CEO @ PhasedAI | Helping Enterprise Transform Operations with Generative AI

发布日期: 2025年2月24日

The Hidden Truth About Successful AI Products

Evals can be more valuable to a product company than their own codebase. This might sound like a bit of ridiculous hyperbole, but when you dig into the product knowledge an Eval dataset contains about your product, the statement starts to make sense. While most businesses focus on selecting the best AI models or perfecting their prompts, there's a behind-the-scenes secret that top AI companies don't often talk about: their evaluation systems, or "evals."

Y Combinator CEO Garry Tan recently revealed that "evals are emerging as the real moat for AI startups." Even more telling, one of the fastest companies to reach a $100M run rate publicly talks about having great "taste," but privately credits their success to "ruthless evals."

Why Evals Matter for Your Business AI

Let's be clear: we're not talking about general AI benchmarks like HELM or MMLU that rank different AI models. Those are important for AI research, but for businesses building real applications, what matters are custom evals tailored to your specific use cases.

Think about it: ChatGPT might score well on academic tests, but can it correctly handle your company's unique:

Product documentation
Customer service scenarios
Industry regulations
Internal policies
Domain-specific knowledge

This is why some founders now consider their eval datasets more valuable than their actual code. They've discovered that the key to building AI applications that don't hallucinate isn't just about using the latest model—it's about rigorous, business-specific testing.

What Gerry and Anjney are saying here is that Evals are why Cursor is so good.

Finding Your Business's "Underserved Slices"

As Garry Tan points out, the real insights come from "founders acting almost as ethnographers spelunking in the underserved slices of the GDP pie chart." In practical terms, this means:

Identifying your business's unique AI needs
Understanding where general AI models fall short
Building comprehensive test cases for these scenarios
Creating a moat through specialized evaluation data

How to Build Your First Eval Dataset (It's Simpler Than You Think)

Let's break it down with a real example. Say you're building an AI chatbot to help employees with company policies. Here's how you'd create your first eval:

Start With a Simple Spreadsheet

Here's exactly what your eval test-case might look like multiples of these are a dataset:

Gather Real Scenarios Document common questions from users Include the exact policy text they should reference Write down the ideal response Test your AI and record its answers

From Simple Spreadsheets to Sophisticated Automation

Here's the industry secret: while fancy AI companies make evaluation sound complex, you can start incredibly simple. Let's look at the progression:

Level 1: The "Eyeball Test"

Start with a basic spreadsheet like the one above. You or your team can manually review AI responses and mark them correct or incorrect. This is perfectly fine for getting started! Many successful AI products began this way.

Level 2: LLM as Judge

Here's where it gets interesting. Instead of manually checking each response, you can use another AI model as an automated judge. Here's how it works:

Your AI: "You need to submit your expense report within 30 days after completing your travel."

Judge LLM Prompt: "Compare this response to the correct answer: 'You must submit your expense report within 30 days after the travel date.'
Is the response:
1. Factually correct?
2. Complete?
3. Appropriately phrased?
Score out of 100 and explain why."

This allows you to automatically evaluate hundreds or thousands of responses quickly and consistently.

Level 3: Agents as Sophisticated Judges

The most advanced companies take this even further. They create specialized AI agents that act as comprehensive testing systems. Here's what these advanced eval agents can do:

1. Conversation Flow Testing

Simulate entire user conversations, not just single questions
Test different conversation paths ("conversation tree testing")
Verify that the AI maintains context over multiple exchanges
Check if the AI appropriately handles topic switches
Ensure the AI can gracefully recover from misunderstandings

2. User Behavior Simulation

Act as different user personas (novice, expert, frustrated customer)
Test responses to unclear or ambiguous questions
Simulate interruptions and conversation restarts
Check handling of urgent vs. routine requests
Test responses to different communication styles

3. Policy and Safety Compliance

Verify responses align with company policies
Check for unauthorized information disclosure
Test handling of sensitive information requests
Ensure consistent enforcement of usage guidelines
Monitor for subtle policy violations across conversations

4. Advanced Reasoning Verification

Check mathematical calculations and logic
Verify citations and source references
Test step-by-step explanation quality
Evaluate problem-solving approaches
Assess accuracy of technical recommendations

5. Dynamic Test Generation

Create new test cases based on real user interactions
Generate variations of existing test scenarios
Identify potential edge cases automatically
Create adversarial tests to probe system limitations
Build comprehensive test suites for specific domains

6. Quality and Consistency Checks

Monitor response tone and professionalism
Check for consistency across similar questions
Verify appropriate use of technical terminology
Test for cultural sensitivity and appropriateness
Ensure responses maintain brand voice guidelines

This sophisticated testing approach is why companies like Cursor, Harvey, and Perplexity can maintain such high quality at scale. Their eval agents are constantly running thousands of these tests, helping them catch issues before they reach users and continuously improve their AI's performance.

For example, when testing a customer service AI, an eval agent might:

Start as a frustrated customer with a billing issue
Test the AI's ability to remain professional while gathering information
Verify that the solution provided matches company policy
Check if the AI appropriately escalates when needed
Ensure all relevant account security protocols are followed

This level of testing might seem extensive, but it's what separates consistently reliable AI products from those that occasionally fail in embarrassing ways.

The Secret to Better Tests: User Feedback Loop

The most successful companies don't stop at their initial test cases. They create a virtuous cycle:

Collect Real User Interactions Log actual user questions Note which answers worked (and which didn't) Record edge cases and unexpected queries
Update Your Eval Dataset Add new test cases based on real usage Include examples of both successes and failures Continuously expand your testing scenarios
Automate and Scale Use AI to generate variations of your test cases Run tests automatically when you make changes Track improvements over time

Why This Matters for Your Business

Above we mentioned Garry Tan recently pointed out that "evals are emerging as the real moat for AI startups." Why? Because good evals create a feedback loop that's hard to replicate each dataset becomes a unique map of product successes:

Better testing → Better product
Better product → More users
More users → More real-world data
More data → Even better testing

This is why some of today's fastest-growing AI companies (including one that reached a $100M run rate in record time) credit their success not to their public image of having great "taste," but to their "ruthless evals" culture behind the scenes.

Getting Started Today

You don't need sophisticated tools to start. Begin with:

A simple spreadsheet of test cases
Real questions from your users
Clear criteria for correct answers
Regular testing and updates

The key is to start small but be systematic. As your needs grow, you can gradually automate and scale your evaluation process.

Phased AI help companies to create custom evals and track them on our end to end trust platform. If you would like some advise on creating Evals contact us for help. We will be happy to help you progress from spreadsheets to fully automated testing.

Our specialized tool, Phased Loop helps you build evaluations into your CI/CD pipeline and track your evals across all your products

As AI continues to evolve, robust evaluation will only become more critical. Companies that master this process now will have a significant advantage. Remember: while everyone else is chasing the latest AI breakthrough, the real winners are quietly building their eval moats, one test case at a time.

Phased Approach

372 位关注者

Kiara Till

Helping Remote Recruitment streamline operations & improve efficiency to connect top talent with leading businesses.

4 天前

This is a fantastic breakdown of why evals are the hidden competitive advantage in AI. The shift from focusing solely on model selection to rigorous, domain-specific evaluation is a game-changer. For businesses integrating AI, the lesson is clear: your moat isn’t just the data you train on—it’s how well you test and refine your AI against real-world scenarios. Whether you're starting with a spreadsheet or scaling to automated eval agents, the key is continuous iteration. The companies winning in AI aren’t just building smarter models; they’re building smarter evaluation systems. That’s where sustainable differentiation truly lies.

Adegbenga Adefemi

??I help #small business owners bring their businesses to profitability with the help of AI?? #digitalmarketing #ai #automation. Want more clients??Click the link below to scale your business today??

4 天前

While AI models and prompts get all the attention, it’s the evaluation systems that determine real-world success. The best AI applications aren’t just ‘smart’—they’re consistently reliable, user-aligned, and continuously improving. Companies that master ruthless evaluation loops will build the strongest competitive moats in AI-driven business. Question: What’s the biggest challenge startups face when setting up effective AI eval systems—data quality, defining the right metrics, or something else?

Prabhin US

Skilled Product Owner with project management expertise in different SDLC methodologies, domains, & technologies.

4 天前

Great Article Richard. Evals are definitely a game changer for AI products. A strong evaluation system makes all the difference in ensuring reliable results.

1 次回应

查看更多评论

要查看或添加评论，请登录

Richard Skinner的更多文章

Phased Approach | Battle of the Deep (Research)

2025年2月18日

Phased Approach | Battle of the Deep (Research)

If you run your own business, especially in technology, one thing you should be doing is more research. Competition…

7 条评论
Phased Approach | Reasoning Models - Thinking Fast and Slow

2025年2月4日

Phased Approach | Reasoning Models - Thinking Fast and Slow

When OpenAI first introduced o1-preview, I was intrigued but didn’t immediately integrate it into my workflow. Its…

3 条评论
Phased Approach: A Crazy Week in AI

2025年1月27日

Phased Approach: A Crazy Week in AI

In a week that felt like watching a decade of AI development unfold in real-time, three seismic shifts have reshaped…
Phased Approach | How Generative AI is Sparking a Scientific Renaissance

2025年1月19日

Phased Approach | How Generative AI is Sparking a Scientific Renaissance

In the later half of the 20th century, the popular discourse was filled with promises of transformative…

1 条评论
Agents - Why Should You Care in 2025?

2024年12月16日

Agents - Why Should You Care in 2025?

If you've been paying even a modicum of attention to the AI hype cycle recently, you've no doubt encountered countless…

3 条评论
Phased Approach | Is AI Progress Slowing?

2024年11月14日

Phased Approach | Is AI Progress Slowing?

In Phased approach this week we look into the new reports that the biggest AI companies are seeing slowing progress in…

1 条评论
Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

2024年11月3日

Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

This week ChatGPT announced its long-awaited search function. It is quite a low key feature, that many in the media…

1 条评论
Phased Approach | Business Adoption is heating up

2024年10月30日

Phased Approach | Business Adoption is heating up

Welcome to Phased Approach, the newsletter where we discuss AI topics for business. Apple are gradually adding their…

1 条评论
Phased Approach | Are LLM Wrappers ok now?

2024年10月21日

Phased Approach | Are LLM Wrappers ok now?

Welcome to Phased Approach, a weekly newsletter exploring topics in Generative AI in Business. Are LLM Wrappers ok now?…
Phased Approach | Why Trustworthy AI Matters for Your Business

2024年10月6日

Phased Approach | Why Trustworthy AI Matters for Your Business

In early 2023, a major tech company found itself in hot water. Their newly launched AI-powered chatbot, which had shown…

See all articles

The Hidden Truth About Successful AI Products

Why Evals Matter for Your Business AI

Finding Your Business's "Underserved Slices"

How to Build Your First Eval Dataset (It's Simpler Than You Think)

From Simple Spreadsheets to Sophisticated Automation

Level 1: The "Eyeball Test"

Level 2: LLM as Judge

Level 3: Agents as Sophisticated Judges

The Secret to Better Tests: User Feedback Loop

Why This Matters for Your Business

Getting Started Today

Phased Approach

372 位关注者

Richard Skinner的更多文章

Phased Approach | Battle of the Deep (Research)

Phased Approach | Reasoning Models - Thinking Fast and Slow

Phased Approach: A Crazy Week in AI

Phased Approach | How Generative AI is Sparking a Scientific Renaissance

Agents - Why Should You Care in 2025?

Phased Approach | Is AI Progress Slowing?

Phased Approach | ChatGPT Search - The Great Unbundling of Search Engines

Phased Approach | Business Adoption is heating up

Phased Approach | Are LLM Wrappers ok now?

Phased Approach | Why Trustworthy AI Matters for Your Business