Where will LLMs be in the Next 12 Months?
Benchmarks. Normally we like to think of technology development as an independent process dictated by markets. While that it true, market requires them to communicate in what they their products and services are better than the rest.And these products are complex. Performance on benchmarks signal the value. As a result benchmark have outsized influence on the roadmaps of the product companies. Model development gets modified to optimize for the benchmark results - for good or bad.
Benchmarks emerge once there is a critical mass of work that has happened in a given area. You could also see the arrival of benchmarks as indicating maturation of today's understanding what is easy and difficult.
Here is a list of benchmarks and summary of each. You want to eyeball them to see where all work is happening.Some observations:
Let me know what you think
-Venkata
1 General & Reasoning
1.1 General Reasoning / Vision
LiveXiv: A dynamic benchmark for visual document understanding using unseen images and questions; updated monthly.
1.2 General Reasoning / Agent
AgentBench; Evaluates autonomous agents across multi‐task scenarios.
1.3 Mathematical Reasoning
FrontierMath: Presents extremely challenging math problems developed in collaboration with top mathematicians.
1.4 General Reasoning
ARC-AGI-2: A next-generation benchmark for solving novel, multi-domain reasoning puzzles.
Humanity's Last Exam: A crowdsourced exam testing AI on abstract, multi-domain reasoning tasks with unreleased questions.
1.5 Real-World Engineering
RE-Bench: Simulates realistic engineering tasks to compare performance between humans and AI agents.
1.6 Factuality/Hallucination
FACTS Grounding: Evaluates an AI model's ability to ground its answers in provided source material to reduce hallucinations.
1.7 AI Governance
AIR-Bench 2024: Summary: A safety benchmark aligning with emerging government regulations and corporate policies via 314 risk categories.
1.8 Multi-disciplinary
OlympicArena Medal Ranks: Uses an Olympic medal table approach to rank AI models across multiple disciplines.
2. Code & IT
2.1 Code/Scientific Reasoning
GPQA Diamond Benchmark: Evaluates AI performance on expert-level science questions, emphasizing technical accuracy.
2.2 Software Engineering
IT Bench:Assesses AI performance on IT coding and automation tasks in enterprise environments.
SWE-bench Verified: Measures the ability of AI to solve real-world GitHub issues and coding problems.
HumanEval-FIM: Evaluates AI's ability to generate correct code that passes unit tests.
3 Legal
3.1 Legal (Generic)
LinksAI English Law Benchmark: Tests AI's capability to answer complex legal questions with precise citations.
Legal Document Review Benchmark: Evaluates how accurately AI can review, summarize, and analyze legal documents and contracts.
4. Medical
4.1 Medical (Generic)
AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments.
5. Autonomous & Robotics
5.1 Robotics
RobotPerf: An Open-Source, Vendor-Agnostic, Benchmarking Suite for Evaluating Robotics Computing System Performance.
5.2 Autonomous Driving
Autonomous Vehicle AI Benchmark: Multiple benchmarks across all areas including reasoning, object detection etc.
6. Financial
6.1 Financial (Generic)
Financial AI Benchmark: Robust benchmarks for financial domain covering models, agents, and deployment.
7. Cybersecurity
7.1 Cybersecurity
Cybersecurity AI Benchmark: Evaluates how well AI detects cybersecurity threats and manages vulnerabilities.
8. Energy
8.1 Energy (Generic)
AI Energy Score: Evaluate, identify, and compare the energy consumption of AI models.
9. Agriculture
9.1 Agriculture
AgriBench: Hierarchical Agriculture Benchmark for Multimodal Large Language Models.
10. Education
10.1 Education (Generic)
Education AI Benchmarks: Multiple benchmarks covering all aspects of learning.
11. Multimodal
11.1 Multimodal (Generic)
Multimodal Interaction Benchmark: Survey of 180 multimodal LLM benchmarks.
12. Safety & Ethics
12.1 AI Governance
RAISE: Assesses enterprise AI governance and risk management practices.
AILuminate Safety of text-to-text interactions with a general purpose AI chat model.
LLM Safety Risk Benchmark: Ranks AI models based on risk profiles and regulatory compliance regarding harmful outputs.
13. Conversational
13.1 Conversational (Generic)
Conversational Context Benchmark: Dynamic Conversational Benchmarking of Large Language Models.
Chatbot Arena Leaderboard: A public leaderboard ranking chatbots based on various conversational performance metrics.
14. Robustness
14.1 AI Governance
Robustness under Adversarial Conditions Benchmark: Tests AI models' resistance to adversarial attacks and noisy inputs.
15. On-device Performance
15.1 Performance
Geekbench AI: Evaluates CPU, GPU, and NPU performance for AI workloads with both accuracy and speed metrics.