AI supervision loops
[This article is a work in progress, to be improved with comments and input from readers. Thanks!]
In today's rapidly evolving world, where Artificial Intelligence (AI) is becoming integral to evaluation processes, the need for effective AI supervision
The starting point: Learning loops
An old favourite of mine is the concept of double-loop learning
"[A] thermostat that automatically turns on the heat whenever the temperature in a room drops below 69°F is a good example of single-loop learning. A thermostat that could ask, 'Why am I set to 69°F?' and then explore whether a different temperature might more economically achieve the goal of heating the room would be engaged in double-loop learning."
Single, double, and triple-loop learning have become widely used in evaluative thinking and reflexive practice. These models are simple to understand yet profound:
So, how might these loops apply to the supervision of AI use in evaluation?
Machines (including thermostats!) are becoming smarter.
A few years ago, a thermostat capable of double-loop learning was purely hypothetical; higher loops were exclusively a domain for human supervision. Now, smart thermostats can perform double-loop learning, analyzing patterns of use, environmental conditions, and energy efficiency goals to suggest or automatically implement new settings. As AI advances, we're entering a realm where machines can master autonomous decisions and reflective learning. The concept of "supervision" of AI in evaluation also needs to evolve. It's no longer just about overseeing machine outputs, but about managing a complex process where AI engages in higher-order decisions —tasks that were once the sole domain of humans. We must then balance automation and human agency: what type and level of AI vs human involvement is necessary, and where?
Which loops for AI supervision?
I ended up with more than the three original loops, which still echo the original concept.
1. Compliance Loop: Tick-boxing. Is AI doing what we specifically asked for?
This basic loop checks if we are getting what we are asking. Are outputs aligning with the pre-established blueprint? Do reports contain the needed information in the required language and formats? At its core, this is a “tick-boxing” exercise—but not a trivial one. AI’s probabilistic nature means its outputs aren’t always reproducible, so even something as simple as verifying consistency can be tricky.
2. Correctness Loop: Accuracy and Validation. Is AI doing things right?
This loop ensures the factual accuracy of AI’s work. It corresponds to single-loop learning. The focus is validating outputs and catching errors, hallucinations, or misalignments. It can be daunting: given the volume of data AI can process and the information it can generate, reviewing each output can be more demanding than producing them. We need efficient strategies—like sampling, anomaly detection, and human-in-the-loop systems. It's also crucial to avoid automation bias, where we might assume "the machine must be right" and fail to assess its output critically.
3. Methodological Loop: Suitability of Tools and Approaches. Is AI doing the right thing - choosing the most suitable tools and approaches?
This reflects double-loop learning. As AI becomes increasingly involved in choosing tools, designing methodologies and whole evaluative processes, supervision ensures these choices are contextually relevant and appropriate for the evaluation. This might include, for example, assessing whether an AI-suggested sampling strategy or data collection method is appropriate for the evaluation's context.
4. Cultural and Ethical Loop: Worldviews—AI’s and Ours. Whose values and views are determining what is right?
This loop encompasses aspects of triple-loop learning, assessing whether AI's decisions, processing, tools and documentation are culturally appropriate. Whose perspective, whose ideology do they reflect? For example: is an AI-conducted interview sensitive to diverse norms, values, and communication styles? While AI systems increasingly incorporate ethical guardrails, we must question their adequacy: are they biased or skewed toward mainstream values? Too broad or too restrictive? Inclusive or at risk of stigmatizing or marginalizing people? Guardrails can sometimes interfere with expressing local realities or unique contexts. As AIs with varying personalities and alignments develop, we must continually reassess the perspectives they adopt. This loop should also address AI degradation, where AIs trained on outputs from other AIs may produce increasingly artificial, less accurate information. This is a challenging loop to supervise, since AI misalignment may reflect the worldview or biases of its likely human supervisors and be hard to spot.
5. Epistemological Loop: AI’s Role in Evaluation. What / whose / which "intelligence" and "knowledge"?
领英推荐
At the highest level, this loop questions how AI influences the very nature of evaluation and evaluative thinking. Is AI enhancing or undermining our understanding of change? How does AI's hidden functioning relate to our ways of making sense of reality and change? Is AI narrowing or expanding our perspectives? Is it responsive to our deepest aspirations and principles? This loop interrogates AI's role in knowledge production, challenging the explainability of its processes and outcomes. It raises questions about the broader systems we operate within, including the existing interconnections, the powers, and the principles at stake. It demands that we consider whether and how AI use can fundamentally alter the nature of knowledge and its generation, as well as our humanity.
Within each loop, we need to consider several factors:
Why loops?
There are already good frameworks, such as Montrosse-Moorhead’s “Seven Criteria for Judging AI in Evaluation”, highlighting important factors for supervising AI use in evaluation - such as efficiency, equity, trust, methodological validity, and explainability. [Source: Montrosse-Moorhead, B. (2023). Evaluation criteria for artificial intelligence. New Directions for Evaluation, 2023, 123–134. https://doi.org/10.1002/ev.20566] The work by MERL Tech and AI ethics champions such Linda Raftree or Emily Springer also pinpoints and shares valuable concerns and criteria.
The added value of loops is to capture the dynamic nature of AI supervision in evaluation, emphasizing interconnectedness across various concerns— technical, ethical, and cultural. Here are some points worth highlighting:
When the learning loop idea was introduced in the organizational management field, it was revolutionary and challenging. In its simplicity, it revealed that organizations could not just think about "effectiveness" without checking their way of working and their own policies. Underlying assumptions had to be challenged.
Similarly, the supervision loops push us not to stop at low-level concerns (i.e. obtaining replicable and valid results) but also to check the nature of AI and the evaluation systems employing it. They highlight important aspects often overlooked in traditional evaluation - increasingly flagged by value-driven, transformative approaches. So AI supervision may catalyze much-needed debate on evaluation methodologies and practices. They reveal that AI biases often reflect our biases, and AI limitations mirror our own. For instance, the current emphasis on AI "explicability" is particularly telling. We now stress the need to "look into the black box" of AI decision-making, but this raises a crucial question: Was our traditional evaluation approach ever truly explicable to begin with? Have we considered whether our methods and paradigms were comprehensible to communities with different cultures or alternative ways of making sense of reality? This newfound focus on explicability in AI serves as a mirror, compelling us to examine the transparency and accessibility of our own long-standing evaluation practices.
By supervising AI at higher levels, we're compelled to interrogate our own work in evaluation. While this makes higher-level supervision challenging, it also offers an opportunity for transformation. Seeing our biases and challenges from a different perspective can help us address them more effectively.
So, who checks?
A key question is: who should oversee each loop? Interestingly, when posed to AI, it immediately presented a supervision process as a top-down hierarchy, with "higher echelons in power" overseeing the upper loops. This would clearly miss the point.
However, the AI answer tends to mirror mainstream thinking, reflecting an existing trend in evaluation where senior experts oversee higher-level assessments while junior staff handle routine checks at the lowest loops. While this division might be practical at times, it risks perpetuating the same issues that evaluation has faced historically—being used as a top-down mechanism for controlling change. The slow adoption of complex, reflective models in favour of linear data processing also reinforces a focus on outputs rather than systemic insights. If AI supervision follows this path, it could further marginalize vulnerable voices and constrain decision-making to those already in power.
To address this, we must rethink the entire approach to evaluation, not just AI supervision. Democratizing supervision loops can empower everyone—from junior staff to marginalized communities—fostering a deeper understanding of knowledge generation and its use in informing change. The aim is to challenge the foundational systems and values driving both AI and our assessments of reality, not just verify outputs.
These concerns aren’t new. The field of cybernetics, which informs AI, has long grappled with similar questions. Project Cybersyn in Chile, now seeing renewed interest, is a prime example. Developed by Stafford Beer for Salvador Allende’s government, it aimed to manage the economy in real-time using the Viable System Model (VSM), which functioned like a biological organism’s brain and nervous system. While Cybersyn’s goal differed from AI supervision, its architecture reflected supervisory loops. The core challenge Cybersyn addressed—ensuring control without falling into authoritarianism—remains relevant today. In Cybersyn, the operations room, or the “brain” of the system, wasn’t reserved for the president or the elite; it was meant for the people. This offers a crucial lesson for AI supervision: democratizing oversight
Where to?
Ethical governance of AI
Independent Consultant
4 个月Ah! And this adds new challenges to supervision! https://gizmodo.com/human-feedback-makes-ai-better-at-deceiving-humans-study-shows-2000503919
Boru Douthwaite you might be interesting in this!
Independent Consultant
5 个月Note: I feel guilty sharing something that might be perceived as a distraction from where our human and humanitarian focus should be. All attention remains on Gaza, Lebanon, and the region.
Independent Consultant
5 个月Thanks, Rick! What a trove of comments—I'm glad it sparked such rich conversation. I think you see the point: loops help highlight the fact that different levels of supervision exist. However, there's also a need to show how these levels connect. Many existing frameworks tend to align with one loop or another, but I believe there's valuable insight to be gained in examining how these levels interact. It's useful to have a tool that captures considerations at various levels, but it's equally important to question the links between them. I do see loops as "relaxed," as you mentioned, but as a way to overcome my fear fear that top-down thinking and linearity may also be applied to supervision. I wish people could recognize that misalignment in outputs isn't always something to "fix" with a tool. Sometimes, these tools can actually sweep deeper issues under the rug. My worry is that we'll see increasing fragmentation of tools and expertise, which is why it's essential to focus on the complexity and richness rather than the linearity of these layers of supervision. Rather than creating a massive, overcomplicated hype-framework, we need a "set of thinking glasses" when approaching the complexities of AI supervision to join the dots.
Independent Monitoring and Evaluation Consultant
5 个月Re Why loops, I think it is harder to implement this idea than at first sight. At least in its original conception (a hierarchy of logical types). You can have activities, criteria for assessing those activities, then criteria for assessing those criteria.... You see all three in dialogues around MSChanges. But I dont think I have ever seen attempts to to debate at higher levels of abstraction. But, I stll think your loops, interpreted in a more relaxed way, are well worth thinking about