On the IQ of AI, or Why We Can not Trust AI Benchmarks
https://sherwood.news/tech/how-do-ai-models-stack-up-vs-humans-on-standardized-benchmarks/

On the IQ of AI, or Why We Can not Trust AI Benchmarks

There is a great meta-review from European Commission, Joint Research Centre (JRC),

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation

Below our brief commentary on why We Can not Trust AI Benchmarks...

The Humanoid AI's IQ

AI’s IQ could be derived from set of standardized tests or subtests designed to assess human faking intelligence:

the Turing Test, as a method of inquiry in AI for determining whether or not a computer is capable of thinking like a human being

Data testing

Model testing

Code testing

Integrations testing

AI/LLMs benchmarks of standard scenarios/tasks/contexts/settings/condition under which LLM’s performance is assessed or tested:

  • Question Answering
  • Reasoning
  • Machine Translation
  • Text Generation and Natural Language Processing.

Never Trust AI Benchmarks

There is a large number of benchmarking data sets, an invented set of trials and tests, for for accuracy, precision, latency or processor inference speed or reliability, general knowledge, bias, common-sense reasoning, mathematical problem-solving, efficiency, cost, focusing on text, while other modalities (audio, images, video, and multimodal systems) remain largely unexamined,

User privacy, copyright infringement, interpretability, ethics, safety areas, or explainability are practically missing all together.

Some composite benchmarks combine a diversity of different evaluation datasets and tasks, such as GLUE, SuperGLUE,?MMLU, BIG-bench, HELM, and?HLE (Humanity's Last Exam).

Microsoft published the PromptBench: A Unified Library for Evaluating and Understanding Large Language Models.

https://www.confident-ai.com/blog/the-current-state-of-benchmarking-llms

Overall, one can not trust AI benchmarks, having the same issues as the training data sets, all could perform well in controlled invented environments, but failing in critical/real world circumstances over time.

One of them is the practical utility of benchmarks ignoring the discriminatory and environmental damages of AI technologies,

Thus allowed for highly energy-inefficient and deeply biased AI models to reach the top of most benchmark leaderboards.

For example, LLMs leaderboards feature various metrics like HellaSwag, MMLU (Massive Multitask Language Understanding), GSM8K, or ARC reasoning, commonsense, and in-depth text understanding, including hallucination rate.

It is all about increasing the AI hype: benchmarks ”serve as the technological spectacle through which companies such as OpenAI and Google can market their technologies”, good for nothing.

The issue of optimising for high benchmark scores at the expense of insight and explanation is known as a form of SOTA-wild goose chasing, marked with benchmark tests to be rigged, tricked and gamed.

Resources

Building General Intelligence Machines by 2027: MAN-MACHINE HYPERINTELLIGENCE as the Unified Intelligence Platform

Abstract.

Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.

AI benchmarks, Benchmark critique, AI evaluation, Safety evaluation, AI Regulation

Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation


要查看或添加评论,请登录

Azamat Abdoullaev的更多文章