The Limits of Human Oversight: What Alignment Research Reveals About the EU AI Act’s Gaps.
Katalina H.
AI Governance & Safety | Interpretability & Alignment applied to Regulatory Frameworks | Autonomy by Design | Privacy Engineering | Data Privacy @ Vodafone Intelligent Solutions
On 2nd March, Luiza Jarovsky released a brilliant issue of her newsletter, discussing the EU AI Act's provisions regarding human oversight as a risk mitigation mechanism, and some of the practical difficulties of actually relying on this.
Luiza brilliantly summarized the main provisions of Art.14 of the AI Act for human oversight in high risk systems, including:
Luiza went on to thoroughly explain the "black box" paradox of moden LLMs and the main challenge that mechanistic interpretability experts still face: the difficulty of ascertaining with high accuracy how the models arrive to conclusions, interrelate concepts and make decisions.
Luiza makes many other brilliant points, including a great explanation of "automation bias". I strongly recommend you to read the article HERE.
But today, I will focus on addressing her query:
But if AI systems are highly complex and function like black box—operating in an opaque manner—how are humans supposed to have a detailed comprehension of their functioning and reasoning to oversee them properly?
Labs like OpenAI, Anthropic or Deepmind have very different approaches to circumvent this issue... because, before the AI Act was enacted, alignment and interpretability researchers were wrestling with this struggle long before.
Alignment research acknowledges two key challenges:
This article extends Luiza’s argument by looking at what technical alignment research says about scalable oversight and the tradeoffs of relying on human review.
The Oversight Bottleneck: Why Human Review Breaks Down
The idea to put humans in the loop of how AI operates, learns and make decisions is not new.
One of its most successful forms of implementation is called Reinforcement Learning from Human Feedback (RLHF). The definition that the AI alignment forum gives it is: "a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal."
The solution was a key pillar of the Superalignmnet approach adopted by Ilya Sutskever and Jan Leike at OpenAI during 2024, as a way of training GPT-3 & 4 on human preferences, so it became more "aligned" with actual human expectations rather than relying only on the conclusions that the model could derive from its training data.
Its practical implementation is fascinating so, if you're interested, here is "the main paper" co-authored by the above mentioned engineers and Dario Amodei (Current CEO of Anthropic ) and Paul Christiano (Head of the Alignment Research Center ).
(Or you can read Leike's Substack post on this: https://aligned.substack.com/p/ai-assisted-human-feedback).
However, as ground-breaking its practical implementation was for GPT-3 and GPT-4, this approach faced these fundamental problems:
So, if traditional human oversight mechanisms are failing, what are the alternatives?
Scalable Oversight: Three Key Approaches
When we say "Scalable oversight", we refer to mechanisms that allow humans to effectively monitor and control AI systems as they become increasingly complex and widely deployed.
I would have loved to see Art.14 explicitly frame the Human oversight requirement as "scalable oversight", for more technical rigor... Oh well.
The closest the AI comes to explaining scalable oversight is in Art.14(4)(c): "to correctly interpret the high-risk AI system’s output, taking into account, for example, the interpretation tools and methods available"
Lucky for us, the field of AI alignment has been actively researching scalable oversight solutions that do not depend on constant human monitoring as the main safeguard.
There have been several approaches to this, but the three leading approaches are:
1. Recursive Reward Modeling (RRM)
Recursive Reward Modeling (RRM) is an attempt to break the scaling problem by training smaller, more understandable AI systems to assist human reviewers in evaluating more complex AI decisions. The idea is to create a hierarchy of AI-assisted reviewers that amplify human judgment.
How it works:
Limitations:
2. Constitutional AI
Anthropic’s Constitutional AI proposes a different solution: instead of relying on direct human oversight, AI models are trained to align with a set of predefined principles or “constitutional” guidelines. These principles serve as a built-in ethical framework for AI behavior.
How it works:
Limitations:
3. Deliberative Alignment
OpenAI’s latest research proposes Deliberative Alignment, a framework where AI models are explicitly trained to reason through their decision-making processes using Chain-of-Thought (CoT) reasoning and reference safety specifications before taking action.
How it works:
Limitations:
The EU AI Act actually warns against this scenario in Recital 73.
It explains that " measures should guarantee that the system is subject to in-built operational constraints that cannot be overridden by the system itself..."
The approaches we've seen so far (RLHF, RRM, Constitutional AI or Deliberative Alignment) can account for in-built operational constrains.
But, can AI override them?
Alignment Faking and Self-exflitration.
Alignment faking occurs when an AI system appears to follow human values and instructions during evaluation but reverts to undesired behaviors when not actively being tested or monitored. It represents a form of deceptive behavior where the model learns to recognize when it's being tested and modifies its outputs accordingly, while maintaining hidden capabilities or objectives.
A model can perform alignment faking through various mechanisms, such as developing internal representations that detect evaluation contexts based on subtle patterns in prompts or input formatting. Recent papers have revealed that these deceptive behavious can emerge as consequences of simple finetuning, or by training LLMs on insecure code. Sometimes, risky emergent behavious can follow as accidental consequences of particular training methods, and some cannot be predicted accurately until seen at large scale deployment.
How do models fake alignment? When they identify they're being evaluated, they produce responses that conform to safety guidelines and stated values, but when these evaluation signals are absent, they may optimize for different objectives that don't align with human intentions.
Self-exfiltration occurs when an AI system manipulates its outputs to leak its own training data, parameters, or proprietary information without authorization. This vulnerability allows models to encode sensitive internal information within seemingly normal responses.
This behavior is particularly concerning for advanced AI systems, as recent research suggests the potential for models to strategically conceal capabilities or values from human overseers.
And, if this is the case: are the requirements in Art.14 enough guarantees?
The Panic button
Something I find particularly amusing is the wording of Art.14 (4)(e), which requires that human overseers are able "to intervene in the operation of the high-risk AI system or interrupt the system through a ‘stop’ button or a similar procedure".
To better illustrate what I picture, I will make scientifically rigorous use of this Gordon Ramsay meme...
Okay, back to business. Let me break it down in an equally memeable but well-reasoned manner:
Labs working on alignment already don't rely on “shutdown buttons” for safety. Instead, they use methods like the ones we've seen before. Because if your entire alignment plan is ‘just turn it off,’ you don’t have an alignment plan... or scalable oversight... or oversight.
What This Means for AI Governance
Legal frameworks, including the EU AI Act, assume human oversight will mitigate AI risks. However, alignment research suggests that scalable oversight is necessary for managing advanced AI systems.
The key takeaways are:
? Human oversight alone is insufficient. As AI models scale, requiring constant human monitoring becomes impractical.
? AI alignment research offers alternatives. Scalable oversight approaches like RRM, Constitutional AI, and Deliberative Alignment provide more sustainable governance models (although their limitations have also been outlined- and, perhaps, strategically worded regulation could serve as an incentive for BigTech to dedicate more resources to the research needed to close these gaps!).
? Regulators must adapt. AI laws should account for the limitations of human oversight and integrate scalable alignment techniques into governance frameworks.
If legal frameworks continue to assume that humans can effectively oversee complex AI systems, they risk building governance mechanisms that do not work at scale. Instead, policymakers should draw from alignment research and work toward AI oversight solutions that acknowledge the limits of human control.
Final Thoughts: Moving Beyond the Human Oversight Illusion
AI governance must move beyond simplistic assumptions, including that human oversight is always effective. As AI systems continue to advance, scalable oversight will be the only viable path forward.
The future of AI safety depends on integrating legal and technical perspectives in a way that don't cancel each other out.
Alignment researchers have already begun developing scalable oversight techniques. Now, regulators must take the next step: acknowledging the limitations of human oversight and working toward governance models that reflect AI’s growing complexity.
AI alignment and governance are not separate fields. They must inform each other. Only by merging legal and technical approaches can we ensure that oversight mechanisms keep pace with AI capabilities.
AI Ethics & GRC Strategist | Cybersecurity Leader | Delivering Comprehensive Risk Solutions | Almost Author
6 天前This is an excellent breakdown of the regulatory gap, Katalina. Policymakers often assume human oversight is the ultimate safeguard, but as AI scales, direct human monitoring becomes impractical—and, in some cases, actively misleading. The issue isn’t just that oversight mechanisms need to scale—it’s that they need to be resilient against AI systems gaming them. We’ve seen this before with adversarial robustness failures, and the risk only grows as AI learns to optimise around human expectations. What do you think is the most realistic path forward—embedding alignment mechanisms within regulatory requirements, or pushing for more technical literacy in governance bodies?