Imagine you're a parent looking at your kid's report card…
They got an A+ in AIME, a C+ in SWE-bench, and a D+ in SimpleQA. You will be like what does that even mean? Welcome to the world of LLM evaluation benchmarks
When you compare LLMs you are basically looking at their? report cards
Those wild charts showing:
- Gemini 2.5 scored 92.0% on AIME
- GPT-4.5 got 62.5% on SimpleQA
- Claude aced SWE-bench with 70.3%
It’s not random. Each of these is like a subject in school — testing different skills.
The question is, why should you care?
“Gemini 2.5 crushed AIME with 92%!”
You should know that’s math olympiad-level performance — not just chatbot conversation.
When Claude tops SWE-bench? It means it’s better at playing the role of an actual junior developer
Lets look at what each of these mean and a funny grading of the model for each subject…metric!
- Humanity's Last Exam (Reasoning & Knowledge) This exam is designed to test the raw intelligence of someone whether humans or machines without the usage of external tools or resources like a google search. It's meant to simulate a "final exam" for humanity, where the model must rely solely on its internal knowledge and ability to reason. The questions asked will evaluate deep reasoning capabilities, comprehensive general knowledge across domains, tests the understanding of the model in history, science, philosophy etc without use of google or any search engine or a calculator etc. Some examples are "Discuss the philosophical implications of quantum mechanics on the concept of free will." "Analyze the long term effects of the silk road on the spread of disease and culture." "Explain the key differences between various schools of economic thought, such as Keynesian and Classical, and their practical implications."
- GPQA Diamond is a specialized benchmark designed to evaluate a model's ability to answer extremely challenging, graduate-level physics questions. It stands for Graduate-level Physics Question Answering, Diamond Tier, signifying the highest level of difficulty within the GPQA dataset. The high scores achieved by some models indicate significant progress in AI's capacity for advanced scientific reasoning.
- AIME (American Invitational Mathematics Examination) 2024 & 2025 is used to evaluate models mathematical problem solving skills. It tests high school maths such as high level advanced maths, Math competition problems, something like if square(x) - 4(x) + 7 = -, find the product of its roots. This is evaluated in two methods. Single attempt where model provides answer right away, multiple attempts where model can correct the answer upon feedback
- LiveCodeBench v5 is a benchmark designed to evaluate a model's ability to generate functional code from natural language descriptions. It tests whether model can generate working code by understanding instructions. It checks for model output’s accuracy, syntactical correctness, relevance to the instructions provided etc
- Aider Polyglot is a benchmark that evaluates a model's ability to edit and improve existing code across various programming languages. It checks for model’s code editing capabilities, ability to support multiple languages (polyglot), Real world relevance etc
- SWE-bench is a benchmark that evaluates a model's ability to operate like a software engineer, encompassing the entire workflow from identifying issues to implementing and validating fixes, much like a robotic agent. Ex: Given a GitHub issue, the model identifies the relevant files, writes the fix, and passes unit tests
- SimpleQA is a benchmark designed to evaluate a model's ability to provide factually accurate answers to straightforward, direct questions. It measures the model's capacity to retrieve and present correct factual information.The questions are simple and require concise, factual responses.Ex:,"What year did the Berlin Wall fall?" → Correct answer: 1989
- MMMU (Massive Multi-modal Understanding) is a benchmark designed to evaluate a model's ability to understand and reason about information presented in both images and text (multi-modal). It assesses the model's capacity to process and integrate information from visual and textual inputs.Ex: Present a graph of stock price trends and ask, "Which month saw the highest volatility?"
- Vibe-Eval is a benchmark designed to assess a model's ability to understand and interpret real-world images. Ex: "Describe what's happening in this street scene: cars, traffic lights, pedestrians.”. It is vital for applications that require accurate image interpretation, such as image captioning, autonomous systems (e.g., self-driving cars), and accessibility technologies for visually impaired individuals
- MRCR (Multi-Reference Comprehension and Reasoning) is a benchmark designed to evaluate a model's ability to process and understand extremely long documents, such as those containing up to 1 million tokens. Ex: Ask a question about a specific legal provision buried within a 500-page document. It is crucial for applications that involve processing and analyzing large volumes of text, such as legal document analysis, research paper summarization, and policy document review etc
- Global MMLU (Massive Multitask Language Understanding) is a benchmark that assesses a model's general knowledge and reasoning abilities across a wide range of languages.Ex: "Who was the first president of India?” – in Hindi. It is crucial for applications that require multilingual support, such as global customer service, localization of content, and cross-cultural communication
If Gemini 2.5 were to be a student and the above are the exams it appeared for:
Overall, Gemini 2.5 turned out to be a B+ student with burns of genius.?
Finally here is the latest livebench rankings of LLMs and we can see Gemini 2.5 acing most of the tests.
On a funny note, here’s what chat GPT 4o has to say about Gemini 2.5? when I fed these hypothetical grades :) - ‘Gemini is that kid who sleeps through History but aces advanced calculus, writes code like a wizard, and reads entire textbooks before lunch. A little quirky, but definitely gifted. Needs snacks during tests’
IT Service Lead | ITSM | ITIL V3/V4 | SIAM | Google Certified Professional Cloud Architect | Problem Management | CSI | Technical Recovery Manager | Change Management | BCP | DR Planning | Agile | Business Analyst
4 天前Interesting post, thanks Nikhil (Srikrishna), Good Read!!
IT Service Lead | ITSM | ITIL V3/V4 | SIAM | Google Certified Professional Cloud Architect | Problem Management | CSI | Technical Recovery Manager | Change Management | BCP | DR Planning | Agile | Business Analyst
4 天前Thanks for sharing, Nikhil (Srikrishna). Very interesting read!!