How Good is Claude 3 Opus Compared to GPT-4, Gemini Ultra?
Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
Formatting will be easier to read on the original article.
Hey Everyone,
With Claude 3 Opus now available on the Poe App by Quora, it’s a really exciting time to dig into this benchmarks around the leading LLMs.
I was so curious about the benchmarks and leaderboards, I asked Alex Irina Sandu of The Strategy Deck to dig into this in greater detail and help us visualize this.
Product and Corporate Strategy Expert
I’m really a fan of her ability to visualize complex business scenarios including in studying competitive landscapes of startups and in product. I depend on her market research and infographics often as a guest contributor like in this post.
You can read her bio at the end of the article. Or contact her on LinkedIn.
Articles to Revisit
?? From our sponsor: ??
Who is Nebius AI?
AI-centric cloud platform Nebius AI offers GPUs from NVIDIA’s latest lineup — the L4, L40, and H100, along with the last-gen A100.??There is a special offer for all our subscribers: sign up and receive a $1,000 USD trial for testing the platform.
If you want to support my emerging tech coverage and get more deep dives in AI, consider supporting my efforts, and visit my other publications.
By Alex Irina Sandu , March 14th, 2024.
With every release of a Large Language Model comes a technical report on its architecture and benchmark results, as well as comparisons with peer models. There are a lot of AI tests to measure skills ranging from general understanding of text to specialized knowledge and abilities in a specific field.
In the past months, there have been eight benchmarks that have become the most popular for inclusion in a new model’s report and that can be used to directly compare performance for major releases.?
They are:
This post describes what each of these tests measures and visualizes results of the largest, most recent models from Anthropic (Claude 3 Opus), OpenAI (GPT-4) and Google (Gemini Ultra), as they are reported in the models’ technical papers.?
If you share the infographics in this post, please credit the author and the source.
The MMLU (Massive Multitask Language Understanding) benchmark, developed in 2021 by researchers from UC Berkeley, Columbia University and University of Chicago, evaluates the world knowledge and problem-solving capabilities of text models across 57 tasks. These tasks encompass a wide range of subjects, including elementary mathematics, US history, computer science, law, and more.?
The estimated human expert-level accuracy on this test is 89.5%, per its technical report. The latest LLMs perform at around the same level on 5-shot tries, with Claude 3 Opus and GPT-4 scoring at 86% and Gemini Ultra at almost 84%.
The DROP (Discrete Reasoning Over Paragraphs) benchmark, created in 2019 by a team at the Allen Institute for AI, challenges AI models to perform complex reasoning over textual data. It requires models to parse detailed narratives and it measures their ability to perform tasks like numerical reasoning, understanding events in sequence, and extracting information that may not be explicitly stated.
According to the technical report, expert human performance on this test is 96.4%. The latest models falls short of that, with results between 80% and 83%.
HellaSwag (Harder Endings, Longer Contexts and Low-shor Activities for Situations With Adversarial Generations) developed in 2019 by researchers at the University of Washington and the Allen Institute for AI, assesses a model's commonsense reasoning ability. It presents scenarios requiring the model to predict plausible continuations, relying on its understanding of real-world dynamics, cause-and-effect relationships, and everyday knowledge. This benchmark gauges how well AI can navigate and interpret typical human experiences and narratives.
According to its technical report, the typical human results on this test are >95%. Two of the three LLMs featured here have reached this level with 10-shot tries, with Claude 3 Opus and GPT-4 having scored 95%, and Gemini Ultra 74%.
If you share the infographics in this post, please credit the author and the source.
MATH, developed in 2021 by researchers at UC Berkeley and UChicago, is designed to evaluate AI models' capabilities in math and comprises 12,500 problems from high school competitions. The LLMs are getting about more than half of them right, with scores of 61% for Claude 3 Opus, 52.8% for GPt-4 and 53.2% for Gemini Ultra.?
GSM8k, developed in 2021 by researchers from OpenAI, contains 8.5K high quality linguistically diverse grade school math word problems. The problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer.?
According to the technical report, a bright middle school student should be able to solve every problem. Claude 3 Opus, GPT-4 and Gemini Ultra all reach high scores on this test, ranging from 92% to 95%/
MGSM (Multilingual Grade School Math) is a collection of 250 grade-school math problems from the GSM8K translated in 10 languages: Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali and Telugu.
Among the three models, Claude Opus 3 performs significantly better, with a score of 90.7%, followed by Gemini Ultra with 79% and GPT-4, with 74.5%.?
HumanEval, introduced in 2021 by a team of researchers, including those from OpenAI, is designed to assess the coding capabilities of models, focusing on their ability to generate functional Python code.?
Among the three models, Claude Opus 3 performs the best, with 84.9% for 0-shot tries, followed by Gemini Ultra with 74.4% and GPT-4 with 67%.?
If you share the infographics in this post, please credit the author and the source.
BIG-Bench-Hard was released in 2023 and is an advanced benchmark designed to evaluate the abilities of large language models in tackling challenging tasks that are considered hard for current AI systems. These include problems that require creative thinking, nuanced understanding of human emotions and intentions, and the ability to handle abstract concepts or poorly defined problems.
The tasks are diverse, ranging from advanced language understanding and commonsense reasoning to creative problem solving and ethical judgment.?
Current models do well on this benchmark on 3-shot Chain-of-Thought tries, with scores between 83% for GPT-4 and 86%.8 for Claude 3 Opus.
It’s been great to see the rapid advance of AI models on benchmarks, culminating with the most recent results below. But it’s not over! While results on language understanding and reasoning are impressive, hallucinations are still an important issue that the industry needs to address in order to provide even more reliable foundation models and AI applications. Similarly, performance in math problems is a great area to tackle and it will be amazing to see LLMs become able to solve the hardest, most advanced challenges. And last, but certainly not least, better multilingual support for people across the globe is another achievement that will come (hopefully) soon.?
Author’s Bio
Alex Sandu, the author of this guest post, a popular contributor to AI Supremacy, and the writer of The Strategy Deck, a newsletter focused on AI market analysis, is seeking a new role.
To talk to Alex about how she can be a valuable addition to your team, reach out at alex [at] TheStrategyDeck [dot] com, or on LinkedIn.
Editor’s Notes
I hope you found this article stimulating. Let’s talk a little more about all of these commonly used benchmarks so you can get a layman’s overview.
LLM benchmarks and leaderboards have become a bit “over-hyped” in early 2024.
Model Performance Across Key LLM Benchmarks
These are most commonly utilized LLM Benchmarks among models’ technical reports:
In general as of March, 2024 Claude 3 Opus has the best average score across all benchmarks. Some users have found that Opus excels in human-alignment in ways that makes it their prefered tool.
Further Reading
42 slides:
Anthropic has shown with the help of benchmarks across ten different evaluations that Claude 3 beats Gemini and GPT-4 in every one of those aspects.
However it’s looking like a lot of these benchmarks need a re-do five years later. When the benchmarks are being used this much in marketing, they are likely to be manipulated. Although this is just my opinion.
But how to these model performance benchmarks actually work? You might want to read their papers.
Commonsense Inference
Reasoning Benchmark
ARC - Reasoning benchmark
Reading Comprehension + Discrete Reasoning
DROP - A Reading Comprehension + Discrete Reasoning Benchmark
Multitask Accuracy
MMLU - Measuring Massive Multitask Language Understanding
Measuring Truthfulness
TruthfulQA
12,500 Questions on Mathematical Reasoning
MATH - Arithmetic Reasoning
Arithmetic Reasoning
GSM8K - Arithmetic Reasoning
Code Generation Tasks
HumanEval - Coding Benchmark
Python Coding Fundamentals
MBPP - Coding Benchmark for Python problems
Limitations of LLM Benchmarks
While the limitations are beyond the scope of my editorial insights on the topic, we can make some very quick conclusions:
Future Capabilities
BigBench - Predicting future potential