登录查看更多内容

Gemini 2.5 Got an A+ in Math and a D+ in SimpleQA” — Wait, What???

Nikhil (Srikrishna) Challa

Cloud, Data & AI expertise ? Google Cloud Champion Innovator??? Authorised Trainer ? Startup advisor ? Writer ? Twitter/X @srikrishna6488

发布日期: 2025年3月28日

Imagine you're a parent looking at your kid's report card…

They got an A+ in AIME, a C+ in SWE-bench, and a D+ in SimpleQA. You will be like what does that even mean? Welcome to the world of LLM evaluation benchmarks

When you compare LLMs you are basically looking at their? report cards

Those wild charts showing:

Gemini 2.5 scored 92.0% on AIME
GPT-4.5 got 62.5% on SimpleQA
Claude aced SWE-bench with 70.3%

It’s not random. Each of these is like a subject in school — testing different skills.

The question is, why should you care?

When someone says:

“Gemini 2.5 crushed AIME with 92%!”

You should know that’s math olympiad-level performance — not just chatbot conversation.

When Claude tops SWE-bench? It means it’s better at playing the role of an actual junior developer

Lets look at what each of these mean and a funny grading of the model for each subject…metric!

Humanity's Last Exam (Reasoning & Knowledge) This exam is designed to test the raw intelligence of someone whether humans or machines without the usage of external tools or resources like a google search. It's meant to simulate a "final exam" for humanity, where the model must rely solely on its internal knowledge and ability to reason. The questions asked will evaluate deep reasoning capabilities, comprehensive general knowledge across domains, tests the understanding of the model in history, science, philosophy etc without use of google or any search engine or a calculator etc. Some examples are "Discuss the philosophical implications of quantum mechanics on the concept of free will." "Analyze the long term effects of the silk road on the spread of disease and culture." "Explain the key differences between various schools of economic thought, such as Keynesian and Classical, and their practical implications."
GPQA Diamond is a specialized benchmark designed to evaluate a model's ability to answer extremely challenging, graduate-level physics questions. It stands for Graduate-level Physics Question Answering, Diamond Tier, signifying the highest level of difficulty within the GPQA dataset. The high scores achieved by some models indicate significant progress in AI's capacity for advanced scientific reasoning.
AIME (American Invitational Mathematics Examination) 2024 & 2025 is used to evaluate models mathematical problem solving skills. It tests high school maths such as high level advanced maths, Math competition problems, something like if square(x) - 4(x) + 7 = -, find the product of its roots. This is evaluated in two methods. Single attempt where model provides answer right away, multiple attempts where model can correct the answer upon feedback
LiveCodeBench v5 is a benchmark designed to evaluate a model's ability to generate functional code from natural language descriptions. It tests whether model can generate working code by understanding instructions. It checks for model output’s accuracy, syntactical correctness, relevance to the instructions provided etc
Aider Polyglot is a benchmark that evaluates a model's ability to edit and improve existing code across various programming languages. It checks for model’s code editing capabilities, ability to support multiple languages (polyglot), Real world relevance etc
SWE-bench is a benchmark that evaluates a model's ability to operate like a software engineer, encompassing the entire workflow from identifying issues to implementing and validating fixes, much like a robotic agent. Ex: Given a GitHub issue, the model identifies the relevant files, writes the fix, and passes unit tests
SimpleQA is a benchmark designed to evaluate a model's ability to provide factually accurate answers to straightforward, direct questions. It measures the model's capacity to retrieve and present correct factual information.The questions are simple and require concise, factual responses.Ex:,"What year did the Berlin Wall fall?" → Correct answer: 1989
MMMU (Massive Multi-modal Understanding) is a benchmark designed to evaluate a model's ability to understand and reason about information presented in both images and text (multi-modal). It assesses the model's capacity to process and integrate information from visual and textual inputs.Ex: Present a graph of stock price trends and ask, "Which month saw the highest volatility?"
Vibe-Eval is a benchmark designed to assess a model's ability to understand and interpret real-world images. Ex: "Describe what's happening in this street scene: cars, traffic lights, pedestrians.”. It is vital for applications that require accurate image interpretation, such as image captioning, autonomous systems (e.g., self-driving cars), and accessibility technologies for visually impaired individuals
MRCR (Multi-Reference Comprehension and Reasoning) is a benchmark designed to evaluate a model's ability to process and understand extremely long documents, such as those containing up to 1 million tokens. Ex: Ask a question about a specific legal provision buried within a 500-page document. It is crucial for applications that involve processing and analyzing large volumes of text, such as legal document analysis, research paper summarization, and policy document review etc
Global MMLU (Massive Multitask Language Understanding) is a benchmark that assesses a model's general knowledge and reasoning abilities across a wide range of languages.Ex: "Who was the first president of India?” – in Hindi. It is crucial for applications that require multilingual support, such as global customer service, localization of content, and cross-cultural communication

If Gemini 2.5 were to be a student and the above are the exams it appeared for:

Overall, Gemini 2.5 turned out to be a B+ student with burns of genius.?

Finally here is the latest livebench rankings of LLMs and we can see Gemini 2.5 acing most of the tests.

On a funny note, here’s what chat GPT 4o has to say about Gemini 2.5? when I fed these hypothetical grades :) - ‘Gemini is that kid who sleeps through History but aces advanced calculus, writes code like a wizard, and reads entire textbooks before lunch. A little quirky, but definitely gifted. Needs snacks during tests’

Thanks for reading!

Trilok Chandra

4 天前

Interesting post, thanks Nikhil (Srikrishna), Good Read!!

Trilok Chandra

4 天前

Thanks for sharing, Nikhil (Srikrishna). Very interesting read!!

查看更多评论

要查看或添加评论，请登录

Nikhil (Srikrishna) Challa的更多文章

Part 3 - Data Lifecycle and Product Catalog on Google Cloud

2025年2月12日

Part 3 - Data Lifecycle and Product Catalog on Google Cloud

After diving into the fundamentals in Part 1 and Part 2, we’re now taking it a step further by exploring two critical…
Part 2 - Machine Learning Fundamentals

2025年1月22日

Part 2 - Machine Learning Fundamentals

In Part 1 of this series we learnt about some interesting aspects related to what ML Engineering on GCP works like…
Part 1: Introduction to GCP ML Engineering

2025年1月8日

Part 1: Introduction to GCP ML Engineering

Getting Started with GCP ML Engineering: Build Skills, Not Just Certifications What is the Google Cloud Machine…

7 条评论
Compute options on Google Cloud

2024年11月12日

Compute options on Google Cloud

It is essential for the Data Engineers, ML Engineers and Data architects to know about various storage and compute…
Big Data Architectures: Beyond the Classroom

2024年10月28日

Big Data Architectures: Beyond the Classroom

The scale and complexity of Big data architectures at enterprise level is too big compared to what we learn in online…

2 条评论
Performance essentials - BigQuery & Distributed data processing systems

2024年8月14日

Performance essentials - BigQuery & Distributed data processing systems

As Data Engineers designing and building data platforms, one thing we consistently strive for is cost efficiency and…

2 条评论
Real time Data Analytics solution with Spanner Change Streams

2024年7月25日

Real time Data Analytics solution with Spanner Change Streams

A real time streaming analytics solution with Cloud Spanner, BigQuery, Dataflow & Looker Studio If you are aspiring to…

3 条评论
Demystifying the Role of a Data Engineer

2024年4月9日

Demystifying the Role of a Data Engineer

Data engineering is an exciting field to be a part of, but it can come with its own set of challenges. One of the most…
Is cloud spanner underrated?

2024年4月5日

Is cloud spanner underrated?

What is cloud Spanner? Cloud Spanner is one of the google cloud's storage options which is highly available, strongly…

1 条评论
Data Integration patterns for ML/Data Engineers

2024年3月6日

Data Integration patterns for ML/Data Engineers

If you are a data engineer or ML engineer, it is essential to have a good understanding of different data integration…

4 条评论

See all articles

Nikhil (Srikrishna) Challa的更多文章

Part 3 - Data Lifecycle and Product Catalog on Google Cloud

Part 2 - Machine Learning Fundamentals

Part 1: Introduction to GCP ML Engineering

Compute options on Google Cloud

Big Data Architectures: Beyond the Classroom

Performance essentials - BigQuery & Distributed data processing systems

Real time Data Analytics solution with Spanner Change Streams

Demystifying the Role of a Data Engineer

Is cloud spanner underrated?

Data Integration patterns for ML/Data Engineers

社区洞察