Quotient AI转发了
“When we first started hooking Copilot Chat in, we realized we’d get everything under the sun—people asking for random stuff that had nothing to do with code. We had billions of requests, so we had to cluster the logs just to figure out what was actually happening. That’s how we discovered real usage patterns—and that’s how we got serious about building our eval harness [...] At the end of the day, without evaluations, you’re flying completely blind. If you can’t measure it, you can’t improve it.” Check out Freddie Vargus and Reid Mayo from OpenPipe dropping some knowledge in the latest podcast ??
Engineers who know me know I’ve been on a Evals kick for a few months now – interviewing top Founders, Staff AI Engineers, and thought leaders in the space. I’ve traveled all over the US going to AI conferences back to back to back to back to back and I keep doubling down on the Evals topic for one practical reason. It continues to pop up in conversation after conversation as the single most challenging problem in the Applied AI engineering space. This was my experience in late 2023 – and it’s still true in 2025. For this reason I’m incredibly excited to announce my interview with one of the leading minds in Evals – Freddie Vargus. Freddie Vargus (and his Co-founder Julia Neagu) led the team that built the Evals for the first significant LLM-backed product post ChatGPT. You know, that one whose name defined all "human in the loop" AI products ever since? Github Copilot So they’ve been deeply serious about this topic for years. Post Github they decided to go all-in by founding evals company Quotient AI. Their mission since has been to make SOTA Evals techniques accessible for builders (who want to get sh*t done, but in a way that doesn’t compromise the future of their tech). Key Insights from our convo: - CIAI (Continuous Improvement of AI): Audit usage logs to surface gaps in your Knowledge Base or other canonical sources of Ground Truth. Patching gaps to systematically improve Agent/Copilot quality. (Jared Scheel knows all about this) - Monitor Outcome Distributions: Map real-world output distributions to expected output distributions to surface potential issues (Agent has four tools but calls one 99% of the time? Look into that) - Evals ARE your product: Measuring and monitoring quality is more than just “tech debt reduction.” Unless you are OpenAI, Anthropic, Deepseek or some other SOTA lab building foundational AI, the fundamental value of your GenAI product is its ability to STEER foundational AI and ALIGN IT to your end-customer’s needs. Evals are critical to both. - Two-Week Evals Sprint: Bootstrapping evals for a project can feel daunting. Take (balanced) action by predefining the evals objectives/tasks you will execute on, and set hard deadlines to avoid quagmires. - Evaluate Subcomponents: Don’t just evaluate final outputs, isolate and test retrieval pipelines, tool calls – everything upstream from final output has potential side-effects on output. Freddie is a hardcore technical founder and he’s unusually hardcore on this topic – don’t miss it. https://lnkd.in/gHUx3wbD