5 Globally Accepted Benchmarks to Assess LLMs on Safety
Tejash Mehta
Customer Success Leader..[Opinions or views expressed here are solely my personal opinions]
AI Testing in Disarray: No Easy Way to Choose an Ethical Model
There's a growing problem in the world of AI development – a lack of agreement on how to test if these models are behaving responsibly. This, according to the latest AI Index from Stanford's Institute for Human-Centered Artificial Intelligence (HAI), released earlier this week.
The big concern? Businesses and everyday users are left in the dark. With no clear way to compare AI models, how can they choose one that aligns with their needs and values?
"There's a huge difference in how AI models behave depending on what they're designed for," explains Nestor Maslej, editor of the 2024 AI Index. "The challenge is, there just aren't any simple tools for comparing them, and it doesn't seem like a solution is coming anytime soon."
Take benchmark tests for responsible AI. TruthfulQA, one of the most common ones, is only used by a handful of leading developers. OpenAI, Meta, and Anthropic all put their models through this test, but Google and Mistral haven't used it on their latest creations.
The level of enthusiasm for responsibility testing also varies wildly. Meta stands out for putting its Llama 2 model through three different tests, while Mistral hasn't used any of the five evaluated by Stanford.
领英推荐
Here's the rub: current benchmarks tend to be very specific. TruthfulQA looks at a model's honesty based on its training data. Others, like RealToxicityPrompts and Toxic Gen, focus on how likely a model is to generate hateful content.
"There's definitely a lack of standardization," says Maslej. "What's causing it is unclear, but some developers might be cherry-picking tests that make their models look better. Or maybe they're making it harder for users to see the limitations."
This lack of standardized testing tools has actually given rise to a new organization – the Responsible AI Institute, backed by major companies. They've developed their own set of benchmarking tools to address the gap.
The bigger picture? AI developers and academics are locked in a heated debate. Which AI risks are the most pressing? Is it the immediate bias creeping into model outputs, or the potential "existential threats" posed by highly advanced AI systems in the future?
The Stanford AI Index also sheds light on regional trends. The US leads in building significant AI models (61) compared to the EU (21) and China (15). However, China dominates in AI patents (61%), while the US holds the crown for private investment ($67.2 billion). Interestingly, in 2023, industry pumped out 108 new foundation models, compared to just 28 from academia.