Gemini 2.5 Got an A+ in Math and a D+ in SimpleQA” — Wait, What???

Gemini 2.5 Got an A+ in Math and a D+ in SimpleQA” — Wait, What???

Imagine you're a parent looking at your kid's report card…

They got an A+ in AIME, a C+ in SWE-bench, and a D+ in SimpleQA. You will be like what does that even mean? Welcome to the world of LLM evaluation benchmarks

When you compare LLMs you are basically looking at their? report cards

Those wild charts showing:

  • Gemini 2.5 scored 92.0% on AIME
  • GPT-4.5 got 62.5% on SimpleQA
  • Claude aced SWE-bench with 70.3%

It’s not random. Each of these is like a subject in school — testing different skills.

The question is, why should you care?

When someone says:

“Gemini 2.5 crushed AIME with 92%!”

You should know that’s math olympiad-level performance — not just chatbot conversation.

When Claude tops SWE-bench? It means it’s better at playing the role of an actual junior developer

Lets look at what each of these mean and a funny grading of the model for each subject…metric!

  1. Humanity's Last Exam (Reasoning & Knowledge) This exam is designed to test the raw intelligence of someone whether humans or machines without the usage of external tools or resources like a google search. It's meant to simulate a "final exam" for humanity, where the model must rely solely on its internal knowledge and ability to reason. The questions asked will evaluate deep reasoning capabilities, comprehensive general knowledge across domains, tests the understanding of the model in history, science, philosophy etc without use of google or any search engine or a calculator etc. Some examples are "Discuss the philosophical implications of quantum mechanics on the concept of free will." "Analyze the long term effects of the silk road on the spread of disease and culture." "Explain the key differences between various schools of economic thought, such as Keynesian and Classical, and their practical implications."
  2. GPQA Diamond is a specialized benchmark designed to evaluate a model's ability to answer extremely challenging, graduate-level physics questions. It stands for Graduate-level Physics Question Answering, Diamond Tier, signifying the highest level of difficulty within the GPQA dataset. The high scores achieved by some models indicate significant progress in AI's capacity for advanced scientific reasoning.
  3. AIME (American Invitational Mathematics Examination) 2024 & 2025 is used to evaluate models mathematical problem solving skills. It tests high school maths such as high level advanced maths, Math competition problems, something like if square(x) - 4(x) + 7 = -, find the product of its roots. This is evaluated in two methods. Single attempt where model provides answer right away, multiple attempts where model can correct the answer upon feedback
  4. LiveCodeBench v5 is a benchmark designed to evaluate a model's ability to generate functional code from natural language descriptions. It tests whether model can generate working code by understanding instructions. It checks for model output’s accuracy, syntactical correctness, relevance to the instructions provided etc
  5. Aider Polyglot is a benchmark that evaluates a model's ability to edit and improve existing code across various programming languages. It checks for model’s code editing capabilities, ability to support multiple languages (polyglot), Real world relevance etc
  6. SWE-bench is a benchmark that evaluates a model's ability to operate like a software engineer, encompassing the entire workflow from identifying issues to implementing and validating fixes, much like a robotic agent. Ex: Given a GitHub issue, the model identifies the relevant files, writes the fix, and passes unit tests
  7. SimpleQA is a benchmark designed to evaluate a model's ability to provide factually accurate answers to straightforward, direct questions. It measures the model's capacity to retrieve and present correct factual information.The questions are simple and require concise, factual responses.Ex:,"What year did the Berlin Wall fall?" → Correct answer: 1989
  8. MMMU (Massive Multi-modal Understanding) is a benchmark designed to evaluate a model's ability to understand and reason about information presented in both images and text (multi-modal). It assesses the model's capacity to process and integrate information from visual and textual inputs.Ex: Present a graph of stock price trends and ask, "Which month saw the highest volatility?"
  9. Vibe-Eval is a benchmark designed to assess a model's ability to understand and interpret real-world images. Ex: "Describe what's happening in this street scene: cars, traffic lights, pedestrians.”. It is vital for applications that require accurate image interpretation, such as image captioning, autonomous systems (e.g., self-driving cars), and accessibility technologies for visually impaired individuals
  10. MRCR (Multi-Reference Comprehension and Reasoning) is a benchmark designed to evaluate a model's ability to process and understand extremely long documents, such as those containing up to 1 million tokens. Ex: Ask a question about a specific legal provision buried within a 500-page document. It is crucial for applications that involve processing and analyzing large volumes of text, such as legal document analysis, research paper summarization, and policy document review etc
  11. Global MMLU (Massive Multitask Language Understanding) is a benchmark that assesses a model's general knowledge and reasoning abilities across a wide range of languages.Ex: "Who was the first president of India?” – in Hindi. It is crucial for applications that require multilingual support, such as global customer service, localization of content, and cross-cultural communication

If Gemini 2.5 were to be a student and the above are the exams it appeared for:

Overall, Gemini 2.5 turned out to be a B+ student with burns of genius.?

Finally here is the latest livebench rankings of LLMs and we can see Gemini 2.5 acing most of the tests.

On a funny note, here’s what chat GPT 4o has to say about Gemini 2.5? when I fed these hypothetical grades :) - ‘Gemini is that kid who sleeps through History but aces advanced calculus, writes code like a wizard, and reads entire textbooks before lunch. A little quirky, but definitely gifted. Needs snacks during tests’

Thanks for reading!

Trilok Chandra

IT Service Lead | ITSM | ITIL V3/V4 | SIAM | Google Certified Professional Cloud Architect | Problem Management | CSI | Technical Recovery Manager | Change Management | BCP | DR Planning | Agile | Business Analyst

4 天前

Interesting post, thanks Nikhil (Srikrishna), Good Read!!

回复
Trilok Chandra

IT Service Lead | ITSM | ITIL V3/V4 | SIAM | Google Certified Professional Cloud Architect | Problem Management | CSI | Technical Recovery Manager | Change Management | BCP | DR Planning | Agile | Business Analyst

4 天前

Thanks for sharing, Nikhil (Srikrishna). Very interesting read!!

回复

要查看或添加评论,请登录

Nikhil (Srikrishna) Challa的更多文章

  • Part 3 - Data Lifecycle and Product Catalog on Google Cloud

    Part 3 - Data Lifecycle and Product Catalog on Google Cloud

    After diving into the fundamentals in Part 1 and Part 2, we’re now taking it a step further by exploring two critical…

  • Part 2 - Machine Learning Fundamentals

    Part 2 - Machine Learning Fundamentals

    In Part 1 of this series we learnt about some interesting aspects related to what ML Engineering on GCP works like…

  • Part 1: Introduction to GCP ML Engineering

    Part 1: Introduction to GCP ML Engineering

    Getting Started with GCP ML Engineering: Build Skills, Not Just Certifications What is the Google Cloud Machine…

    7 条评论
  • Compute options on Google Cloud

    Compute options on Google Cloud

    It is essential for the Data Engineers, ML Engineers and Data architects to know about various storage and compute…

  • Big Data Architectures: Beyond the Classroom

    Big Data Architectures: Beyond the Classroom

    The scale and complexity of Big data architectures at enterprise level is too big compared to what we learn in online…

    2 条评论
  • Performance essentials - BigQuery & Distributed data processing systems

    Performance essentials - BigQuery & Distributed data processing systems

    As Data Engineers designing and building data platforms, one thing we consistently strive for is cost efficiency and…

    2 条评论
  • Real time Data Analytics solution with Spanner Change Streams

    Real time Data Analytics solution with Spanner Change Streams

    A real time streaming analytics solution with Cloud Spanner, BigQuery, Dataflow & Looker Studio If you are aspiring to…

    3 条评论
  • Demystifying the Role of a Data Engineer

    Demystifying the Role of a Data Engineer

    Data engineering is an exciting field to be a part of, but it can come with its own set of challenges. One of the most…

  • Is cloud spanner underrated?

    Is cloud spanner underrated?

    What is cloud Spanner? Cloud Spanner is one of the google cloud's storage options which is highly available, strongly…

    1 条评论
  • Data Integration patterns for ML/Data Engineers

    Data Integration patterns for ML/Data Engineers

    If you are a data engineer or ML engineer, it is essential to have a good understanding of different data integration…

    4 条评论

社区洞察