Our new research paper: Adding Error Bars to Evals. AI model evaluations don’t usually include statistics or uncertainty. We think they should. Read the blog post: https://lnkd.in/d2jKfpyT When a new AI model is released, the accompanying model card typically reports a matrix of evaluation scores on a variety of standard evaluations, such as MMLU, GPQA, or the LSAT. But it’s unusual for these scores to include any indication of the uncertainty, or randomness, surrounding them. This omission makes it difficult to compare the evaluation scores of two models in a rigorous way. “Randomness” in language model evaluations may take a couple of forms. Any stream of output tokens from a model may be nondeterministic, and so re-evaluating the same model on the same evaluation may produce slightly different results each time. This randomness is known as measurement error. But there’s another form of randomness that’s not visible by the time an evaluation is performed. This is the sampling error; of all possible questions one could ask about a topic, we decide to include some questions in the evaluation, but not others. In our research paper, we recommend techniques for reducing measurement error and properly quantifying sampling error in model evaluations. With a simple assumption in place—that evaluation questions were randomly drawn from some underlying distribution—we develop an analytic framework for model evaluations using statistical theory. Drawing on the science of experimental design, we make a series of recommendations for performing evaluations and reporting the results in a way that maximizes the amount of information conveyed. Our paper makes five core recommendations. These recommendations will likely not surprise readers with a background in statistics or experimentation, but they are not standard in the world of model evaluations. Specifically, our paper recommends: 1. Computing standard errors using the Central Limit Theorem 2. Using clustered standard errors when questions are drawn in related groups 3. Reducing variance by resampling answers and by analyzing next-token probabilities 4. Using paired analysis when two models are tested on the same questions 5. Conducting power analysis to determine whether an evaluation can answer a specific hypothesis. For mathematical details on the theory behind each recommendation, read the full research paper here: https://lnkd.in/dBrr9zFi.
Practicality isn’t about ignoring rigor; it’s about knowing where to draw the line. For AI teams, precision beyond what shifts the needle might as well be a luxury yacht in a swamp—fancy, but stuck. At some point, computationally expensive techniques like clustering or resampling risk becoming solutions in search of a problem. The challenge, then, is balance. Push for rigor where it matters—critical systems, safety-sensitive AI—but accept that most real-world deployments are about survival, not perfection. And in the game of iteration, “good enough” often wins because it shows up first.
This is a fascinating and important research paper! I completely agree that adding error bars to AI model evaluations is crucial for rigorous comparison and interpretation of results. It's great to see recommendations for reducing measurement error and properly quantifying sampling error in model evaluations. I particularly appreciate the emphasis on using statistical theory and experimental design to maximize the amount of information conveyed. Thank you for sharing this valuable contribution to the field!
Thanks for sharing. For those looking for data science/engineering/analyst internships or new grad roles, you can apply to hundreds of them at https://tinyurl.com/2dbps6em for free. New positions are added daily.
Thank you for emphasizing the importance of statistical rigor in LLM results. Given the inherent randomness in LLM outputs and their potential impact on business decisions, it's vital to properly account for uncertainty. By employing techniques like power analysis, standard error calculations, and variance analysis, we can better assess the risks associated with these models. In the future, it would be beneficial to see LLM results presented with error bars or confidence intervals, rather than just point estimates. This approach would provide a more comprehensive understanding of the model's predictions and their reliability.
Uncertainty is a key factor in decision making.
This is truly an interesting point that is rarely discussed when talking about LLMs. I also appreciate the part in this post where you highlight the need to add these parameters to evaluation standards. From my point of view, it would also be necessary to have a common understanding of these standards because every time a new LLM model is released, different measurements are reported, and for those without in-depth knowledge on the subject, it can be confusing.
Stanford Alum | Building the future we imagine. | Advancing Responsible, Human-Centered AI
2 天前As a sociotechnical scholar, I view the opacity in existing evaluation processes as dangerous. These outdated methods are often uncontested despite their influence on critical decisions. It is refreshing to engage with research that reflects a commitment to interrogating and redefining methodologies to align with the evolving nature of our systems. Anthropic is setting a precedent by investigating the anatomy of AI with the rigor and integrity we need to build intentionally. Their initiatives are living processes designed to adapt. This is a standard the industry ought to embrace if we seek meaningful innovation.