How Good is Claude 3 Opus Compared to GPT-4, Gemini Ultra?

How Good is Claude 3 Opus Compared to GPT-4, Gemini Ultra?


Formatting will be easier to read on the original article.

Hey Everyone,

With Claude 3 Opus now available on the Poe App by Quora, it’s a really exciting time to dig into this benchmarks around the leading LLMs.

I was so curious about the benchmarks and leaderboards, I asked Alex Irina Sandu of The Strategy Deck to dig into this in greater detail and help us visualize this.

Product and Corporate Strategy Expert

I’m really a fan of her ability to visualize complex business scenarios including in studying competitive landscapes of startups and in product. I depend on her market research and infographics often as a guest contributor like in this post.

Learn More


Contact Her Directly

You can read her bio at the end of the article. Or contact her on LinkedIn.


Articles to Revisit

  1. Market Map: Gen AI Companies with Foundational Models
  2. How BigTech Invested in AI Companies in 2023
  3. An Overview of Google’s AI Product Strategy


?? From our sponsor: ??

Who is Nebius AI?

AI-centric cloud platform Nebius AI offers GPUs from NVIDIA’s latest lineup — the L4, L40, and H100, along with the last-gen A100.??There is a special offer for all our subscribers: sign up and receive a $1,000 USD trial for testing the platform.


Start Training


If you want to support my emerging tech coverage and get more deep dives in AI, consider supporting my efforts, and visit my other publications.


Subscribe now




By Alex Irina Sandu , March 14th, 2024.


With every release of a Large Language Model comes a technical report on its architecture and benchmark results, as well as comparisons with peer models. There are a lot of AI tests to measure skills ranging from general understanding of text to specialized knowledge and abilities in a specific field.

In the past months, there have been eight benchmarks that have become the most popular for inclusion in a new model’s report and that can be used to directly compare performance for major releases.?

They are:

  • MMLU, DROP and HellaSwag for language understanding and reasoning
  • MATH, GSM8K, MGSM and HumanEval for math and programming
  • Big Bench Hard for advanced, abstract tasks

This post describes what each of these tests measures and visualizes results of the largest, most recent models from Anthropic (Claude 3 Opus), OpenAI (GPT-4) and Google (Gemini Ultra), as they are reported in the models’ technical papers.?

If you share the infographics in this post, please credit the author and the source.


The MMLU (Massive Multitask Language Understanding) benchmark, developed in 2021 by researchers from UC Berkeley, Columbia University and University of Chicago, evaluates the world knowledge and problem-solving capabilities of text models across 57 tasks. These tasks encompass a wide range of subjects, including elementary mathematics, US history, computer science, law, and more.?

The estimated human expert-level accuracy on this test is 89.5%, per its technical report. The latest LLMs perform at around the same level on 5-shot tries, with Claude 3 Opus and GPT-4 scoring at 86% and Gemini Ultra at almost 84%.

The DROP (Discrete Reasoning Over Paragraphs) benchmark, created in 2019 by a team at the Allen Institute for AI, challenges AI models to perform complex reasoning over textual data. It requires models to parse detailed narratives and it measures their ability to perform tasks like numerical reasoning, understanding events in sequence, and extracting information that may not be explicitly stated.

According to the technical report, expert human performance on this test is 96.4%. The latest models falls short of that, with results between 80% and 83%.

HellaSwag (Harder Endings, Longer Contexts and Low-shor Activities for Situations With Adversarial Generations) developed in 2019 by researchers at the University of Washington and the Allen Institute for AI, assesses a model's commonsense reasoning ability. It presents scenarios requiring the model to predict plausible continuations, relying on its understanding of real-world dynamics, cause-and-effect relationships, and everyday knowledge. This benchmark gauges how well AI can navigate and interpret typical human experiences and narratives.

According to its technical report, the typical human results on this test are >95%. Two of the three LLMs featured here have reached this level with 10-shot tries, with Claude 3 Opus and GPT-4 having scored 95%, and Gemini Ultra 74%.

If you share the infographics in this post, please credit the author and the source.


MATH, developed in 2021 by researchers at UC Berkeley and UChicago, is designed to evaluate AI models' capabilities in math and comprises 12,500 problems from high school competitions. The LLMs are getting about more than half of them right, with scores of 61% for Claude 3 Opus, 52.8% for GPt-4 and 53.2% for Gemini Ultra.?

GSM8k, developed in 2021 by researchers from OpenAI, contains 8.5K high quality linguistically diverse grade school math word problems. The problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer.?

According to the technical report, a bright middle school student should be able to solve every problem. Claude 3 Opus, GPT-4 and Gemini Ultra all reach high scores on this test, ranging from 92% to 95%/

MGSM (Multilingual Grade School Math) is a collection of 250 grade-school math problems from the GSM8K translated in 10 languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali and Telugu.

Among the three models, Claude Opus 3 performs significantly better, with a score of 90.7%, followed by Gemini Ultra with 79% and GPT-4, with 74.5%.?

HumanEval, introduced in 2021 by a team of researchers, including those from OpenAI, is designed to assess the coding capabilities of models, focusing on their ability to generate functional Python code.?

Among the three models, Claude Opus 3 performs the best, with 84.9% for 0-shot tries, followed by Gemini Ultra with 74.4% and GPT-4 with 67%.?

If you share the infographics in this post, please credit the author and the source.


BIG-Bench-Hard was released in 2023 and is an advanced benchmark designed to evaluate the abilities of large language models in tackling challenging tasks that are considered hard for current AI systems. These include problems that require creative thinking, nuanced understanding of human emotions and intentions, and the ability to handle abstract concepts or poorly defined problems.

The tasks are diverse, ranging from advanced language understanding and commonsense reasoning to creative problem solving and ethical judgment.?

Current models do well on this benchmark on 3-shot Chain-of-Thought tries, with scores between 83% for GPT-4 and 86%.8 for Claude 3 Opus.

It’s been great to see the rapid advance of AI models on benchmarks, culminating with the most recent results below. But it’s not over! While results on language understanding and reasoning are impressive, hallucinations are still an important issue that the industry needs to address in order to provide even more reliable foundation models and AI applications. Similarly, performance in math problems is a great area to tackle and it will be amazing to see LLMs become able to solve the hardest, most advanced challenges. And last, but certainly not least, better multilingual support for people across the globe is another achievement that will come (hopefully) soon.?

Author’s Bio

Alex Sandu, the author of this guest post, a popular contributor to AI Supremacy, and the writer of The Strategy Deck, a newsletter focused on AI market analysis, is seeking a new role.

  • With expertise in building global consumer, developer and open source tech products, Alex brings 15 years of experience in cross-functional management roles, including Technical Product Management and Corporate Strategy.?
  • Alex is a seasoned expert in building features from concept to launch for hundreds of millions of users worldwide, driving product strategy through data and insights, and managing strategic operations and annual planning for large organizations.
  • An experienced Product and Strategy Manager, Alex excels in aligning product vision with customer and market requirements, building competitive differentiation for your company and driving product development and impact in market.

To talk to Alex about how she can be a valuable addition to your team, reach out at alex [at] TheStrategyDeck [dot] com, or on LinkedIn.

Editor’s Notes

I hope you found this article stimulating. Let’s talk a little more about all of these commonly used benchmarks so you can get a layman’s overview.

LLM benchmarks and leaderboards have become a bit “over-hyped” in early 2024.

Model Performance Across Key LLM Benchmarks

These are most commonly utilized LLM Benchmarks among models’ technical reports:

  1. MMLU - Multitask accuracy
  2. HellaSwag - Reasoning
  3. HumanEval - Python coding tasks
  4. BBHard - Probing models for future capabilities
  5. GSM-8K - Grade school math
  6. MATH - Math problems with 7 difficulty levels

In general as of March, 2024 Claude 3 Opus has the best average score across all benchmarks. Some users have found that Opus excels in human-alignment in ways that makes it their prefered tool.

Further Reading

42 slides:

Read more into Claude 3


Anthropic has shown with the help of benchmarks across ten different evaluations that Claude 3 beats Gemini and GPT-4 in every one of those aspects.

However it’s looking like a lot of these benchmarks need a re-do five years later. When the benchmarks are being used this much in marketing, they are likely to be manipulated. Although this is just my opinion.


Try Claude 3


But how to these model performance benchmarks actually work? You might want to read their papers.

Commonsense Inference

  • HellaSwag - Measuring Commonsense Inference

paper | dataset [released 2019]

Read the Paper

Reasoning Benchmark

ARC - Reasoning benchmark

paper | dataset [released 2019]

Read the Paper

Reading Comprehension + Discrete Reasoning

DROP - A Reading Comprehension + Discrete Reasoning Benchmark

paper | dataset [released 2019]

Read the Paper

Multitask Accuracy

MMLU - Measuring Massive Multitask Language Understanding

paper | dataset [released 2021]

Read the Paper

Measuring Truthfulness

TruthfulQA

paper | dataset [released 2022]

Read the Paper


12,500 Questions on Mathematical Reasoning

MATH - Arithmetic Reasoning

paper | dataset [released 2021]

Read the Paper

Arithmetic Reasoning

GSM8K - Arithmetic Reasoning

paper | dataset [released 2021]

Read the Paper

Code Generation Tasks

HumanEval - Coding Benchmark

paper | dataset [released 2021]

Read the Paper

Python Coding Fundamentals

MBPP - Coding Benchmark for Python problems

paper | dataset [ released 2021]

Read the Paper

Limitations of LLM Benchmarks

While the limitations are beyond the scope of my editorial insights on the topic, we can make some very quick conclusions:

  1. Many benchmarks have restricted scope, usually targeting capabilities on which LLMs have already proven some proficiency.
  2. One limitation is the lack of distinction between questions that can be answered using internal knowledge of LLMs and those that require external tools.
  3. Critics argue that LLM benchmarks can be unreliable due to various factors, such as training data contamination and models overperforming on carefully crafted inputs.
  4. Benchmarks for language modeling often don't last long in terms of usefulness. Once language models reach a level of performance that's as good as humans on these benchmarks, the benchmarks are typically either stopped and swapped out or updated by adding harder challenges.
  5. AI benchmarks are flawed - with dataset contamination, biases and are often not representative of real world use cases, but until we find alternatives they will continue to be created and used.

  • According to Vellum.AI:

Future Capabilities

BigBench - Predicting future potential

paper | dataset [released 2023]

Read the Paper



要查看或添加评论,请登录

Michael Spencer的更多文章

  • How AI Datacenters Work

    How AI Datacenters Work

    Good Morning, Get the full inside scoop on key AI topics for less than $2 a week with a premium subscription to my…

    1 条评论
  • How Nvidia is down 30% from its Highs

    How Nvidia is down 30% from its Highs

    If like me, you are wondering why Nvidia is down more than 20% this year even when the demand is still raging for AI…

    6 条评论
  • What DeepSeek Means for AI Innovation

    What DeepSeek Means for AI Innovation

    Welcome to another article by Artificial Intelligence Report. LinkedIn has started to "downgrade" my work.

    14 条评论
  • What is Vibe Coding?

    What is Vibe Coding?

    Good Morning, Get access to my best and complete work for less than $2 a week with premium access. I’m noticing two…

    22 条评论
  • TSMC "kisses the Ring" in Trump Chip Fab Announcement

    TSMC "kisses the Ring" in Trump Chip Fab Announcement

    Good Morning, To get the best of my content, for less than $2 a week become a premium subscriber. In the history of the…

    8 条评论
  • GPT-4.5 is Not a Frontier Model

    GPT-4.5 is Not a Frontier Model

    To get my best content for less than $2 a week, subscribe here. Guys, we have to talk! OpenAI in the big picture is a…

    15 条评论
  • On why LLMs cannot truly reason

    On why LLMs cannot truly reason

    ?? In partnership with HubSpot ?? HubSpot Integrate tools on HubSpot The HubSpot Developer Platform allows thousands of…

    3 条评论
  • Can AI Lead us to Enlightenment?

    Can AI Lead us to Enlightenment?

    This is a guest post by Chad Woodford, JD, MA, to read the entire thing read the original published today here. For…

    13 条评论
  • Apple is a Stargate Too for American Jobs and R&D ??

    Apple is a Stargate Too for American Jobs and R&D ??

    Apple's $500 Billion Investment Plan in the U.S.

    5 条评论
  • OpenAI o3 Deep Research vs. Google Gemini Deep Research

    OpenAI o3 Deep Research vs. Google Gemini Deep Research

    Good Morning, A whole lot of Deep Research, as we wait for Anthropic and OpenAI models like GPT-4.5.

    4 条评论

社区洞察