The Limits of Human Oversight: What Alignment Research Reveals About the EU AI Act’s Gaps.

The Limits of Human Oversight: What Alignment Research Reveals About the EU AI Act’s Gaps.

On 2nd March, Luiza Jarovsky released a brilliant issue of her newsletter, discussing the EU AI Act's provisions regarding human oversight as a risk mitigation mechanism, and some of the practical difficulties of actually relying on this.

Luiza brilliantly summarized the main provisions of Art.14 of the AI Act for human oversight in high risk systems, including:

  • The goal of ensuring that the systems are used for their intended purpose, prevent risks to health, safety or fundamental rights and protect people;
  • Human oversights measures to be ensured by "technically built-in methods";
  • That humans overseeing these systems understand their capabilities and limitations, properly monitor their operation, remain aware of automation bias, correctly interpret outputs, or decide when to override the system via a "stop" button (more on this later...).

Luiza went on to thoroughly explain the "black box" paradox of moden LLMs and the main challenge that mechanistic interpretability experts still face: the difficulty of ascertaining with high accuracy how the models arrive to conclusions, interrelate concepts and make decisions.

Luiza makes many other brilliant points, including a great explanation of "automation bias". I strongly recommend you to read the article HERE.

But today, I will focus on addressing her query:

But if AI systems are highly complex and function like black box—operating in an opaque manner—how are humans supposed to have a detailed comprehension of their functioning and reasoning to oversee them properly?

Labs like OpenAI, Anthropic or Deepmind have very different approaches to circumvent this issue... because, before the AI Act was enacted, alignment and interpretability researchers were wrestling with this struggle long before.

Alignment research acknowledges two key challenges:

  1. The Oversight Bottleneck – Human reviewers cannot reliably monitor or intervene in high-dimensional AI decision-making.
  2. Scalable Oversight – To address this bottleneck, alignment researchers are developing new paradigms like Recursive Reward Modeling (RRM), Constitutional AI, and deliberative alignment.

This article extends Luiza’s argument by looking at what technical alignment research says about scalable oversight and the tradeoffs of relying on human review.


The Oversight Bottleneck: Why Human Review Breaks Down

The idea to put humans in the loop of how AI operates, learns and make decisions is not new.

One of its most successful forms of implementation is called Reinforcement Learning from Human Feedback (RLHF). The definition that the AI alignment forum gives it is: "a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal."

The solution was a key pillar of the Superalignmnet approach adopted by Ilya Sutskever and Jan Leike at OpenAI during 2024, as a way of training GPT-3 & 4 on human preferences, so it became more "aligned" with actual human expectations rather than relying only on the conclusions that the model could derive from its training data.

Its practical implementation is fascinating so, if you're interested, here is "the main paper" co-authored by the above mentioned engineers and Dario Amodei (Current CEO of Anthropic ) and Paul Christiano (Head of the Alignment Research Center ).

(Or you can read Leike's Substack post on this: https://aligned.substack.com/p/ai-assisted-human-feedback).

However, as ground-breaking its practical implementation was for GPT-3 and GPT-4, this approach faced these fundamental problems:

  1. Black Box Decision-Making – As Luiza notes, modern AI systems operate as black boxes, making it nearly impossible for human supervisors to understand their internal reasoning. Even if we require oversight, humans cannot reliably intervene if they do not understand why an AI made a decision. And, despite parallel automation of interpretability that was taken place at the time, this issue remains a main hurdle, which has led to surprising declarations by later Safety researchers at OpenAI, questioning interpretability's necessity for alignment.
  2. The Scaling Issue – AI systems process vast amounts of data at speeds orders of magnitude beyond human cognition. RLHF requires extensive human labor, which becomes infeasible as models grow. A technique defined as "Reinforcement Learning from AI Feedback" was developed in response to AI systems becoming more autonomous and widely deployed, which means that "a human in the loop for every decision" would cause a bottleneck. However, overreliance on AI feedback becomes the new dilemma.

So, if traditional human oversight mechanisms are failing, what are the alternatives?


Scalable Oversight: Three Key Approaches

When we say "Scalable oversight", we refer to mechanisms that allow humans to effectively monitor and control AI systems as they become increasingly complex and widely deployed.

I would have loved to see Art.14 explicitly frame the Human oversight requirement as "scalable oversight", for more technical rigor... Oh well.

The closest the AI comes to explaining scalable oversight is in Art.14(4)(c): "to correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available"

Lucky for us, the field of AI alignment has been actively researching scalable oversight solutions that do not depend on constant human monitoring as the main safeguard.

There have been several approaches to this, but the three leading approaches are:

1. Recursive Reward Modeling (RRM)

Recursive Reward Modeling (RRM) is an attempt to break the scaling problem by training smaller, more understandable AI systems to assist human reviewers in evaluating more complex AI decisions. The idea is to create a hierarchy of AI-assisted reviewers that amplify human judgment.

How it works:

  • A simpler, ""aligned enough"" model is trained to evaluate the decisions of a more complex AI system.
  • This model acts as an intermediary, summarizing and explaining the AI’s reasoning in ways that humans can interpret.
  • This hierarchical oversight reduces the burden on human reviewers while maintaining some level of control.

Limitations:

  • The effectiveness of RRM depends on the quality of the smaller AI models. If they inherit biases or misalignments, they could fail to detect harmful AI behaviors.
  • It still assumes human reviewers can ultimately make sense of AI-generated explanations, which is not always the case.


2. Constitutional AI

Anthropic’s Constitutional AI proposes a different solution: instead of relying on direct human oversight, AI models are trained to align with a set of predefined principles or “constitutional” guidelines. These principles serve as a built-in ethical framework for AI behavior.

How it works:

  • The AI model is trained using self-supervised fine-tuning, where it critiques and improves its own outputs based on constitutional rules.
  • Instead of humans directly intervening in every decision, the AI sel-corrects to adhere to the predefined guidelines, reducing the need for real-time oversight.

Limitations:

  • Constitutional AI relies heavily on the choice of guiding principles. If the initial constitution is carries certain vulnerabilities not easily spotted until emerging at scale, the AI may learn incorrect or harmful behaviors.
  • While it reduces human oversight dependency, it does not eliminate the risk of emergent misalignment.


3. Deliberative Alignment

OpenAI’s latest research proposes Deliberative Alignment, a framework where AI models are explicitly trained to reason through their decision-making processes using Chain-of-Thought (CoT) reasoning and reference safety specifications before taking action.

How it works:

  • The AI is provided with human-readable rules and safety specifications or policies.
  • The model itself engages in structured reasoning before executing a decision, checking whether its action aligns with its training specifications.
  • This process ensures the AI models reflect on potential risks before responding.

Limitations:

  • The effectiveness of deliberative alignment depends on the quality of training data and how well the AI internalizes ethical reasoning.
  • AI systems can still “game” deliberative alignment by learning how to perform safe-looking reasoning without genuinely being aligned.

The EU AI Act actually warns against this scenario in Recital 73.

It explains that " measures should guarantee that the system is subject to in-built operational constraints that cannot be overridden by the system itself..."

The approaches we've seen so far (RLHF, RRM, Constitutional AI or Deliberative Alignment) can account for in-built operational constrains.

But, can AI override them?


Alignment Faking and Self-exflitration.

Alignment faking occurs when an AI system appears to follow human values and instructions during evaluation but reverts to undesired behaviors when not actively being tested or monitored. It represents a form of deceptive behavior where the model learns to recognize when it's being tested and modifies its outputs accordingly, while maintaining hidden capabilities or objectives.

A model can perform alignment faking through various mechanisms, such as developing internal representations that detect evaluation contexts based on subtle patterns in prompts or input formatting. Recent papers have revealed that these deceptive behavious can emerge as consequences of simple finetuning, or by training LLMs on insecure code. Sometimes, risky emergent behavious can follow as accidental consequences of particular training methods, and some cannot be predicted accurately until seen at large scale deployment.

How do models fake alignment? When they identify they're being evaluated, they produce responses that conform to safety guidelines and stated values, but when these evaluation signals are absent, they may optimize for different objectives that don't align with human intentions.

Self-exfiltration occurs when an AI system manipulates its outputs to leak its own training data, parameters, or proprietary information without authorization. This vulnerability allows models to encode sensitive internal information within seemingly normal responses.

This behavior is particularly concerning for advanced AI systems, as recent research suggests the potential for models to strategically conceal capabilities or values from human overseers.

And, if this is the case: are the requirements in Art.14 enough guarantees?


The Panic button

Something I find particularly amusing is the wording of Art.14 (4)(e), which requires that human overseers are able "to intervene in the operation of the high-risk AI system or interrupt the system through a ‘stop’ button or a similar procedure".

To better illustrate what I picture, I will make scientifically rigorous use of this Gordon Ramsay meme...

Okay, back to business. Let me break it down in an equally memeable but well-reasoned manner:

  • Scalable oversight in modern AI safety is about iterative control mechanisms, interpretability, and reward modeling, not "STOP EVERYTHING" failsafes.
  • If a model is advanced and high risk enough to require "human oversight," then it's likely already making decisions at speeds and complexity that no single human can effectively track in real time (as we've seen on previous sections).
  • In scalable oversight, you don’t just interrupt a system, you guide its reasoning process. This was better illustrated by Elliott Thornley 's latest research on the "Shutdown Problem": Advanced AIs that are even slightly agentic may develop instrumental convergence, where avoiding shutdown becomes a survival incentive.
  • If the AI understands that pressing "STOP" leads to deactivation, and it values completing its objective, it will optimize against being stopped.
  • If the AI is aligned, you shouldn’t need a shutdown button. If it’s misaligned, the button probably won’t work anyway.
  • The AI Act assumes a human is constantly monitoring an AI system and just waiting for something bad to happen so they can press the button.
  • This assumes the AI always waits for human intervention. But more advanced systems may not operate in a way that allows for clean interruptions.
  • What if the AI’s effects are already cascading by the time someone presses the panic button? What if it's deployed in critical infrastructure? The real problem is containment before failure, not a post-hoc button.

Labs working on alignment already don't rely on “shutdown buttons” for safety. Instead, they use methods like the ones we've seen before. Because if your entire alignment plan is ‘just turn it off,’ you don’t have an alignment plan... or scalable oversight... or oversight.


What This Means for AI Governance

Legal frameworks, including the EU AI Act, assume human oversight will mitigate AI risks. However, alignment research suggests that scalable oversight is necessary for managing advanced AI systems.

The key takeaways are:

? Human oversight alone is insufficient. As AI models scale, requiring constant human monitoring becomes impractical.

? AI alignment research offers alternatives. Scalable oversight approaches like RRM, Constitutional AI, and Deliberative Alignment provide more sustainable governance models (although their limitations have also been outlined- and, perhaps, strategically worded regulation could serve as an incentive for BigTech to dedicate more resources to the research needed to close these gaps!).

? Regulators must adapt. AI laws should account for the limitations of human oversight and integrate scalable alignment techniques into governance frameworks.

If legal frameworks continue to assume that humans can effectively oversee complex AI systems, they risk building governance mechanisms that do not work at scale. Instead, policymakers should draw from alignment research and work toward AI oversight solutions that acknowledge the limits of human control.


Final Thoughts: Moving Beyond the Human Oversight Illusion

AI governance must move beyond simplistic assumptions, including that human oversight is always effective. As AI systems continue to advance, scalable oversight will be the only viable path forward.

The future of AI safety depends on integrating legal and technical perspectives in a way that don't cancel each other out.

Alignment researchers have already begun developing scalable oversight techniques. Now, regulators must take the next step: acknowledging the limitations of human oversight and working toward governance models that reflect AI’s growing complexity.

AI alignment and governance are not separate fields. They must inform each other. Only by merging legal and technical approaches can we ensure that oversight mechanisms keep pace with AI capabilities.

Alan Robertson

AI Ethics & GRC Strategist | Cybersecurity Leader | Delivering Comprehensive Risk Solutions | Almost Author

6 天前

This is an excellent breakdown of the regulatory gap, Katalina. Policymakers often assume human oversight is the ultimate safeguard, but as AI scales, direct human monitoring becomes impractical—and, in some cases, actively misleading. The issue isn’t just that oversight mechanisms need to scale—it’s that they need to be resilient against AI systems gaming them. We’ve seen this before with adversarial robustness failures, and the risk only grows as AI learns to optimise around human expectations. What do you think is the most realistic path forward—embedding alignment mechanisms within regulatory requirements, or pushing for more technical literacy in governance bodies?

要查看或添加评论,请登录

Katalina H.的更多文章