登录查看更多内容

Where will LLMs be in the Next 12 Months?

Venkata Pingali

Scribble Data | AI for Finance | Knowledge Agents | Co-Founder

发布日期: 2025年2月20日

Benchmarks. Normally we like to think of technology development as an independent process dictated by markets. While that it true, market requires them to communicate in what they their products and services are better than the rest.And these products are complex. Performance on benchmarks signal the value. As a result benchmark have outsized influence on the roadmaps of the product companies. Model development gets modified to optimize for the benchmark results - for good or bad.

Benchmarks emerge once there is a critical mass of work that has happened in a given area. You could also see the arrival of benchmarks as indicating maturation of today's understanding what is easy and difficult.

Here is a list of benchmarks and summary of each. You want to eyeball them to see where all work is happening.Some observations:

Major focus on Trust - safety, governance, and robustness
Major focus on Reasoning - more complex tasks, multi-modal, factulity
Long tail of areas are introducing benchmarks indicating serious applications

Let me know what you think

-Venkata

1 General & Reasoning

1.1 General Reasoning / Vision

LiveXiv: A dynamic benchmark for visual document understanding using unseen images and questions; updated monthly.

1.2 General Reasoning / Agent

AgentBench; Evaluates autonomous agents across multi‐task scenarios.

1.3 Mathematical Reasoning

FrontierMath: Presents extremely challenging math problems developed in collaboration with top mathematicians.

https://epoch.ai/frontiermath

1.4 General Reasoning

ARC-AGI-2: A next-generation benchmark for solving novel, multi-domain reasoning puzzles.

https://arcprize.org/guide

Humanity's Last Exam: A crowdsourced exam testing AI on abstract, multi-domain reasoning tasks with unreleased questions.

https://agi.safe.ai/

1.5 Real-World Engineering

RE-Bench: Simulates realistic engineering tasks to compare performance between humans and AI agents.

https://arxiv.org/abs/2411.15114

1.6 Factuality/Hallucination

FACTS Grounding: Evaluates an AI model's ability to ground its answers in provided source material to reduce hallucinations.

https://deepmind.google/discover/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/

1.7 AI Governance

AIR-Bench 2024: Summary: A safety benchmark aligning with emerging government regulations and corporate policies via 314 risk categories.

1.8 Multi-disciplinary

OlympicArena Medal Ranks: Uses an Olympic medal table approach to rank AI models across multiple disciplines.

2. Code & IT

2.1 Code/Scientific Reasoning

GPQA Diamond Benchmark: Evaluates AI performance on expert-level science questions, emphasizing technical accuracy.

https://arxiv.org/abs/2311.12022

2.2 Software Engineering

IT Bench:Assesses AI performance on IT coding and automation tasks in enterprise environments.

https://arxiv.org/abs/2502.05352

SWE-bench Verified: Measures the ability of AI to solve real-world GitHub issues and coding problems.

https://www.swebench.com/

HumanEval-FIM: Evaluates AI's ability to generate correct code that passes unit tests.

https://github.com/openai/human-eval-infilling

3 Legal

3.1 Legal (Generic)

LinksAI English Law Benchmark: Tests AI's capability to answer complex legal questions with precise citations.

https://www.linklaters.com/en/insights/blogs/digilinks/2025/february/uk-the-linksai-english-law-benchmark-version-2

Legal Document Review Benchmark: Evaluates how accurately AI can review, summarize, and analyze legal documents and contracts.

https://openreview.net/forum?id=WqSPQFxFRC

4. Medical

4.1 Medical (Generic)

AgentClinic: A multimodal agent benchmark to evaluate AI in simulated clinical environments.

https://arxiv.org/html/2405.07960v3

5. Autonomous & Robotics

5.1 Robotics

RobotPerf: An Open-Source, Vendor-Agnostic, Benchmarking Suite for Evaluating Robotics Computing System Performance.

5.2 Autonomous Driving

Autonomous Vehicle AI Benchmark: Multiple benchmarks across all areas including reasoning, object detection etc.

https://paperswithcode.com/task/autonomous-driving

6. Financial

6.1 Financial (Generic)

Financial AI Benchmark: Robust benchmarks for financial domain covering models, agents, and deployment.

https://www.thefin.ai/

7. Cybersecurity

7.1 Cybersecurity

Cybersecurity AI Benchmark: Evaluates how well AI detects cybersecurity threats and manages vulnerabilities.

https://arxiv.org/abs/2410.21939

8. Energy

8.1 Energy (Generic)

AI Energy Score: Evaluate, identify, and compare the energy consumption of AI models.

9. Agriculture

9.1 Agriculture

AgriBench: Hierarchical Agriculture Benchmark for Multimodal Large Language Models.

https://www.promptlayer.com/research-papers/can-ai-transform-agriculture-a-new-benchmark-reveals-the-truth

10. Education

10.1 Education (Generic)

Education AI Benchmarks: Multiple benchmarks covering all aspects of learning.

https://ai-for-education.org/find-benchmarks/

11. Multimodal

11.1 Multimodal (Generic)

Multimodal Interaction Benchmark: Survey of 180 multimodal LLM benchmarks.

https://arxiv.org/html/2408.08632v1

12. Safety & Ethics

12.1 AI Governance

RAISE: Assesses enterprise AI governance and risk management practices.

AILuminate Safety of text-to-text interactions with a general purpose AI chat model.

https://ailuminate.mlcommons.org/benchmarks/

LLM Safety Risk Benchmark: Ranks AI models based on risk profiles and regulatory compliance regarding harmful outputs.

https://www.wired.com/story/ai-models-risk-rank-studies/

13. Conversational

13.1 Conversational (Generic)

Conversational Context Benchmark: Dynamic Conversational Benchmarking of Large Language Models.

https://arxiv.org/abs/2409.20222

Chatbot Arena Leaderboard: A public leaderboard ranking chatbots based on various conversational performance metrics.

https://huggingface.co/spaces/lmsys/chatbot-arena

14. Robustness

14.1 AI Governance

Robustness under Adversarial Conditions Benchmark: Tests AI models' resistance to adversarial attacks and noisy inputs.

https://robustbench.github.io/

15. On-device Performance

15.1 Performance

Geekbench AI: Evaluates CPU, GPU, and NPU performance for AI workloads with both accuracy and speed metrics.

https://www.theverge.com/2024/8/15/24221382/geekbench-ai-benchmark-software

Domain-Specific Agents

1,090 位关注者

要查看或添加评论，请登录

Venkata Pingali的更多文章

Agentizing Business Process

2025年3月6日

Agentizing Business Process

Feel the AI stones to cross the agentic river TL;DR Agentization of business processes has started Understanding…

2 条评论
Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

2025年2月27日

Agent-Based Systems Have Arrived: AI Engineer Summit Online 2025

TL;DR: The AI Engineer Online Summit 2025 shows that AI agents are rapidly maturing. The talks had a strong sense of…

10 条评论
Agents Will Take Over IT Service Management

2025年2月13日

Agents Will Take Over IT Service Management

TL;DR ITSM economics is about to breakdown ITSM has a long tail of use cases because of complexity Agents will be…

1 条评论
[Feb 5] Implementation Experiences with Domain LLMs

2025年2月5日

[Feb 5] Implementation Experiences with Domain LLMs

A lot of theoretical work is happening but delivering it to end customers is still a bit of challenge. This week we…
Post-Deepseek World

2025年1月29日

Post-Deepseek World

Deepseek has reset priors of the tech community at large, and opened a much larger application game. Here is a mix of…

4 条评论
Jan 24, 2025 - Knowledge Agents & Economics

2025年1月24日

Jan 24, 2025 - Knowledge Agents & Economics

Welcome! In this edition we have two articles written by me and Rajesh on structure of knowledge agents, and economics…
Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

2024年6月12日

Alignment is Critical: What I’ve Learned About Leading a Cross-Border Startup

Leading a cross-border organization has taught me that success depends on understanding and adapting to unique…

6 条评论
A Year to Remember

2020年12月22日

A Year to Remember

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was…

5 条评论
A CEO’s Guide: The Brave New World of Data Privacy and Accountability

2020年9月24日

A CEO’s Guide: The Brave New World of Data Privacy and Accountability

The compliance landscape involving data is significantly changing in 2020, and it is necessary to understand these…

1 条评论
CEO Guide to Production ML

2020年5月4日

CEO Guide to Production ML

Productionization or operationalization of Machine Learning is the process of making machine learning models run every…

3 条评论

See all articles