When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months. At a high level, our method is simple: 1. We ask both skilled humans and AI systems to attempt tasks in similar conditions. 2. We measure how long the humans take. 3. We then measure how AI success rates vary depending on how long the humans took to do those tasks. We measure human and AI performance on a variety of software tasks. Human completion times on these tasks range from 1 second to 16 hours. We then fit a curve that predicts the success rate of an AI based on how long it took humans to do each task. This curve characterizes how capable an AI is at different task lengths. We then summarize the curve with the task length at which a model’s success rate is 50%. We are fairly confident in the rough trend of 1-4 doublings in horizon length per year. That is fast! Measures like these help make the notion of “degrees of autonomy” more concrete and let us quantify when AI abilities may rise above specific useful (or dangerous) thresholds. We give more high-level information about these results and what they might imply on the METR blog: https://lnkd.in/gWpeChk3 For the details, read “Measuring AI Ability to Complete Long Tasks,” now on available on arXiv: https://lnkd.in/gR8fGXr4 If you are interested in contributing to more research like this, on quantitative evaluation of frontier AI capabilities: METR is hiring! https://hiring.metr.org/ Read more debate about these results and what they imply in Nature: https://lnkd.in/gA22d8Gn
关于我们
METR works on assessing whether cutting-edge AI systems could pose catastrophic risks to civilization.
- 网站
-
https://metr.org/
METR的外部链接
- 所属行业
- 非盈利组织
- 规模
- 11-50 人
- 总部
- Berkeley,CA
- 类型
- 非营利机构
- 创立
- 2022
地点
-
主要
US,CA,Berkeley
METR员工
动态
-
Today we are excited to share details about HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark we’ve been developing at METR for the past year to measure the abilities of frontier AI systems to complete diverse software tasks autonomously. AI systems are clearly improving quickly, and benchmarks help measure this progress. Many benchmarks focus on tasks that are intellectually demanding for humans, like graduate level science or mathematics questions, or competition-level programming problems. But AI systems are advantaged at tasks involving significant amounts of knowledge compared to humans, and it’s apparent that progress on these benchmarks seems to outpace people’s direct experience of using frontier AIs. One intuitive measure of an AI agents’ capabilities is to ask the question “can I hand an agent a task that would take me X hours to complete, and be confident that the agent will complete it?”. To measure this, you need a) a realistic task distribution, and b) accurate task time estimates. Over the past year we’ve manually created 189 realistic, open-ended, agentic tasks across software engineering, machine learning engineering, cybersecurity, and general reasoning. We then carefully measure the time they take humans to complete. Typically results from AI agents and humans are not directly comparable, e.g. because humans are given significantly more time or resources than AI agents. Uniquely, we measure the length of time that tasks take humans in essentially identical conditions that agents are given. We had 140 people skilled in the relevant domains spend over 1500 hours in total attempting the tasks. We find that the tasks in HCAST take humans between one minute, and over eight hours. We then evaluate AI agents built on four foundation models, including the new Claude 3.7 Sonnet model with extended thinking mode. The best models succeed 70-80% of the time on <1hr tasks, and less than 20% of the time on >4hr tasks. Read the paper here: https://metr.org/hcast.pdf Github: https://lnkd.in/gdqFMFNz
-
-
Legible, faithful reasoning would let us better understand AI decisions. Current reasoning models show promise here, promise that we think should not be given up lightly. By legible, we mean reasoning that is human-readable (for example, done in clear natural language). By faithful, we mean reasoning that accurately reflects the model's internal decision-making process, that allows us to reliably predict its actual behavior. We think the benefits of legible, faithful reasoning are compelling: it would make it easier to find mistakes, easier to surface hidden tendencies, easier to monitor for premeditated attempts at harm, and easier to generally understand the capabilities of AI systems. However, the fate of legible, faithful reasoning in AI is uncertain. Chain-of-thought reasoning isn't entirely faithful even in today's models. And to the extent that legibility and faithfulness require specific design choices, future AI developers may not prioritize them. We also make several recommendations for how AI developers and researchers should respond, given our present uncertainties about the situation. OpenAI recently published a blogpost and paper on monitoring reasoning, in which they state that “CoT monitoring may be one of few tools we will have to oversee superhuman models of the future”. We look forward to continued open dialogue on this subject. Read our blog to see more of our current thoughts on legible, faithful reasoning: https://lnkd.in/gWQ5UCuN
-
-
METR evaluated DeepSeek-R1’s ability to act as an autonomous agent. On generic SWE tasks it performs on-par with o1-preview but worse than 3.5 Sonnet (new) or o1. Overall R1 is ~6 months behind leading US AI companies at agentic SWE tasks and is only a small improvement on V3. But on six challenging AI R&D tasks that take humans over a day to complete, R1 performs much worse than o1-preview or Claude 3.5 Sonnet (old) and is on par with Claude Opus (a model from 11 months ago). It performs as well as a 24th percentile human expert when given 8 hours. DeepSeek reported impressive benchmark scores and they seem authentic. We made sure V3 wasn’t memorizing answers by evaluating it on a held-out set of GPQA questions as well as a paraphrased version of GPQA, and found its performance didn’t degrade at all. Overall R1 differentially excels at knowledge-based tasks over agentic ones. We hope that, before making a model capable of very advanced agentic tasks, DeepSeek joins other leading AI developers in creating a concrete protocol of safety and security measures for these capabilities (read them here: metr.org/faisc). It’s surprising that R1 is barely better at autonomy than V3, since reasoning has caused a step-change in performance for other developers. This may be because R1 is trained from a base model or R1+V3 weren’t post-trained enough to develop agency and coherence. For example, a simple task in our dev set is to make a 3x3 crossword from a word list. V3 decides to add the word “eat”, but there’s no room for it. Instead of reflecting and noticing the error, it assumes the suggestion must be correct and makes something up (https://lnkd.in/gTZuRZ4a). R1’s reasoning (in its CoT) looks even less coherent than V3’s, but we shouldn’t necessarily expect reasoning models to think in a legible way. The garbled thoughts might mean the model is confused or just that it’s thinking in an inscrutable way (https://lnkd.in/g5paYZ9T). For more information on our methodology see our full report on V3: https://lnkd.in/g-QR-nH4 And R1: https://lnkd.in/gEf6XXPi You can read transcripts of DeepSeek models on some of our tasks at https://lnkd.in/gK5equPu
-
-
We appreciated the opportunity to provide feedback on iterative drafts of Amazon's Frontier Model Safety Framework, which outlines capability thresholds related to CBRN, cyber, and automated AI R&D and points toward improving AI safety and security to contain those risks. You can read the full framework here: https://lnkd.in/g7Q964zF We're excited to see an emerging consensus that frontier AI developers should publish concrete plans for enhanced safety and security, and capability thresholds at which those requirements would take effect. We hope to see these policies improve in specificity and strength over time. You can read other comparable frameworks here: metr.org/faisc
-
-
METR received access to an earlier checkpoint of OpenAI’s GPT-4.5, 7 days before release. We ran quick experiments to measure the model’s performance. As with OAI’s results, GPT-4.5 performs above GPT 4o but below o1 or Claude 3.5 Sonnet, with a time horizon score of ~30 minutes. OpenAI also provided us with context and technical information about GPT-4.5, as well as some of GPT-4.5’s benchmark results, which helped us interpret our observations. We think that a promising direction for mitigating risks posed by advanced AI systems is third-party oversight based on verifying developers’ internal results, and we appreciate OpenAI’s help in starting to prototype this. We currently believe that it’s unlikely that GPT-4.5 poses large autonomy risks—especially relative to existing models. But this belief stems mainly from our understanding of frontier model capabilities in general, not our specific eval results. We would also caution that pre-deployment evaluations such as our GPT-4.5 evals are insufficient to rule out all large risks posed by frontier models, because many of the risks posed by models occur before public deployment and because we may be underestimating GPT-4.5’s capabilities. As discussed in a previous blog post, even if evals are accurate, we believe that AI models can be dangerous before public deployment, due to potential model theft by malicious actors and risks resulting from both sanctioned and rogue internal usage (https://lnkd.in/gJJU_n6f). There may exist techniques by which model capabilities could be dramatically improved with comparatively little compute, such as SFT on specific datasets or outcomes-based RL. This makes it hard to upper bound the risks from further development, rogue employees, or model theft. We also cannot rule out deliberate “sandbagging”---the model deliberately performing below its true capabilities in order to further longer-term goals. Related “alignment-faking” behavior has recently been demonstrated in less capable models (https://lnkd.in/gZ8krMwh). For more details on why these concerns might be plausible, and what information or experiments we think are needed to satisfactorily rule them out, see our accompanying blog post: https://lnkd.in/gqQpvVtW
-
-
Last year, AI developers from around the world agreed to the Frontier AI Safety Commitments and to publish frontier AI safety frameworks aimed at evaluating and managing severe AI risks. We have compiled a list of the policies that were published during the Paris AI Action Summit, intended to follow through on this commitment. Find links to all of the published company policies, and related resources, here: https://metr.org/faisc
-
-
METR is running a pilot field experiment to measure how AI tools affect open source developer productivity. If you're an open source developer who wants to make $150/hour to work on issues of your own choosing - consider expressing interest: https://lnkd.in/gqSEkUdN
-
-
Can frontier models cost-effectively accelerate ML workloads via optimizing GPU kernels? Our take at METR: yes, and they’re improving pretty steeply – but it’s easy to miss these capabilities without good elicitation and “fair” compute spend. We measure the average speedup achieved by our “KernelAgent” on a filtered + extended version of KernelBench, obtaining a ~2x speedup for a fraction of the estimated cost of paying an expert kernel engineer. The existing leaderboard suggests minimal speed up, around 1.05x. We attribute our much higher speedup to scaffolding improvements plus higher spending: these tasks are easily checkable, so we do best of K across attempts, for ~$20 per task per model. We estimate that on these tasks this is cost-effective for workloads that take over 30 hours. The speedup achievable with the best model ~doubled over the last ~6 months. Code optimization is only a small part of frontier AI R&D workflows, but many positive feedback loops like this could lead to very rapid progress that outstrips oversight or safety mechanisms. Overall, elicitation took us around 4 engineer-weeks. The compute costs for the whole project were around $50k. We also found that a small amount of finetuning appeared to close most of the gap between GPT4o and o1 (our error bars here are large). The existing KernelBench levels primarily feature “classic” ML architectures. We added a “Level 5” to KernelBench – 14 tasks adapted from frontier generative AI workloads in 2024. We also filtered out 45 tasks that were cheatable or noisy. Our team is pretty confident in its key takeaways, but AI is moving fast and we wanted to share our results quickly, so there’s a higher risk of bugs – these numbers might change with more work. There’s also some noise, although this analysis was performed with >200 independent tasks. Full report here: https://lnkd.in/gcUyu-9N
-