Evolving AGI benchmarks
From https://lifearchitect.ai/basis/

Evolving AGI benchmarks

We all know the Turing test was considered a solid benchmark for measuring AI capability and this has been well surpassed. Now, there is a race to create AGI and at the same time measure if the "system or model" has reached AGI.

This is important to consider for all of humanity. The saga of OpenAI can be also traced to this important question. Microsoft agreement with OpenAI excludes the models which have reached AGI and so it was important for Microsoft to have Sam Altman in OpenAI so that he can claim that the models they have created have not reached it. OpenAI's six-person board of directors will determine when the company has “attained AGI” — a threshold that will exclude Microsoft (on theory).

Now there are 2 new benchmarks which I find fascinating - BASIS and GPQA.

BASIS

The idea behind this benchmark is created by Mensa researcher and metaphysician Dr Jason Betts to design a suite of test items prioritizing imminent artificial superintelligence (ASI), and also including the lower ceiling of advanced AGI .

The BASIS project ensures that superintelligence can be appropriately assessed against very high human biological intelligence ceilings, and removes workarounds like holding out a subset of catalogued data (Common Crawl, books, journals, Wikipedia) from models. Instead, the testing mechanism is replaced with new, unique, and offline questions.

Every question is designed to have an independently verifiable answer by at least one other human. Importantly, the question, answer, and even combination of keywords does not appear on the web and should never have been seen before.

GPQA

This is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. They ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof").

From the paper -


要查看或添加评论,请登录

Rajeswaran V (PhD)的更多文章

  • Scaling laws

    Scaling laws

    A scaling law in deep learning typically takes the form of a power-law relationship, where one variable (e.g.

    1 条评论
  • Copy of GenAI/LLM and productivity

    Copy of GenAI/LLM and productivity

    I will present 3 papers which discuss this from economics point of view. The productivity J-Curve "THE PRODUCTIVITY…

  • Paper clip maximization

    Paper clip maximization

    There is an very interesting thought experiment called "Paper clip maximization" This is a thought experiment by…

  • AI and research

    AI and research

    Microsoft performed a lot of experiments with GPT-4 and released the results in the paper titled "The Impact of Large…

  • Moravec's paradox and CV

    Moravec's paradox and CV

    I want to discuss face recognition and how it fits in with Moravec's paradox. Background Steven Pinker writes "The main…

  • AI robustness

    AI robustness

    When we build AI systems - care should be taken to test its robustness. A decentralized group of safe streets activists…

  • AI for Software Engineering

    AI for Software Engineering

    For corporates, Software Engineering lifecycle is most important. This is most relevant for IT majors on where and how…

  • AI in 2024 - some predictions

    AI in 2024 - some predictions

    There is an old saying "Prediction is very difficult. Especially about the future !".

  • Dangers of over-simplification

    Dangers of over-simplification

    In 2021 Sam Altman wrote an essay "Moore's Law for Everything". It gives some insight into his thinking on how AI will…

  • LLMs and Theory of mind

    LLMs and Theory of mind

    In March when researchers in Stanford published the paper "Theory of Mind Might Have Spontaneously Emerged in Large…

社区洞察

其他会员也浏览了