AI supervision loops

AI supervision loops

[This article is a work in progress, to be improved with comments and input from readers. Thanks!]

In today's rapidly evolving world, where Artificial Intelligence (AI) is becoming integral to evaluation processes, the need for effective AI supervision is crucial. This article explores how the concept of learning loops can be adapted to address the unique challenges posed by supervising AI in evaluation work.

The starting point: Learning loops

An old favourite of mine is the concept of double-loop learning, introduced by organizational theorist Chris Argyris, along with later explorations of the "triple loop". Argyris explained double-loop learning with a simple analogy:

"[A] thermostat that automatically turns on the heat whenever the temperature in a room drops below 69°F is a good example of single-loop learning. A thermostat that could ask, 'Why am I set to 69°F?' and then explore whether a different temperature might more economically achieve the goal of heating the room would be engaged in double-loop learning."

Single, double, and triple-loop learning have become widely used in evaluative thinking and reflexive practice. These models are simple to understand yet profound:

  • Single-loop learning addresses the question, "Are we doing things right?"
  • Double-loop learning asks, "Are we doing the right things?"
  • Some felt the need to introduce a third loop of learning, which asks: "Who decides what is right?" or "Based on what values do we determine what is right?"

So, how might these loops apply to the supervision of AI use in evaluation?

Machines (including thermostats!) are becoming smarter.

A few years ago, a thermostat capable of double-loop learning was purely hypothetical; higher loops were exclusively a domain for human supervision. Now, smart thermostats can perform double-loop learning, analyzing patterns of use, environmental conditions, and energy efficiency goals to suggest or automatically implement new settings. As AI advances, we're entering a realm where machines can master autonomous decisions and reflective learning. The concept of "supervision" of AI in evaluation also needs to evolve. It's no longer just about overseeing machine outputs, but about managing a complex process where AI engages in higher-order decisions —tasks that were once the sole domain of humans. We must then balance automation and human agency: what type and level of AI vs human involvement is necessary, and where?

Which loops for AI supervision?

I ended up with more than the three original loops, which still echo the original concept.

1. Compliance Loop: Tick-boxing. Is AI doing what we specifically asked for?

This basic loop checks if we are getting what we are asking. Are outputs aligning with the pre-established blueprint? Do reports contain the needed information in the required language and formats? At its core, this is a “tick-boxing” exercise—but not a trivial one. AI’s probabilistic nature means its outputs aren’t always reproducible, so even something as simple as verifying consistency can be tricky.

2. Correctness Loop: Accuracy and Validation. Is AI doing things right?

This loop ensures the factual accuracy of AI’s work. It corresponds to single-loop learning. The focus is validating outputs and catching errors, hallucinations, or misalignments. It can be daunting: given the volume of data AI can process and the information it can generate, reviewing each output can be more demanding than producing them. We need efficient strategies—like sampling, anomaly detection, and human-in-the-loop systems. It's also crucial to avoid automation bias, where we might assume "the machine must be right" and fail to assess its output critically.

3. Methodological Loop: Suitability of Tools and Approaches. Is AI doing the right thing - choosing the most suitable tools and approaches?

This reflects double-loop learning. As AI becomes increasingly involved in choosing tools, designing methodologies and whole evaluative processes, supervision ensures these choices are contextually relevant and appropriate for the evaluation. This might include, for example, assessing whether an AI-suggested sampling strategy or data collection method is appropriate for the evaluation's context.

4. Cultural and Ethical Loop: Worldviews—AI’s and Ours. Whose values and views are determining what is right?

This loop encompasses aspects of triple-loop learning, assessing whether AI's decisions, processing, tools and documentation are culturally appropriate. Whose perspective, whose ideology do they reflect? For example: is an AI-conducted interview sensitive to diverse norms, values, and communication styles? While AI systems increasingly incorporate ethical guardrails, we must question their adequacy: are they biased or skewed toward mainstream values? Too broad or too restrictive? Inclusive or at risk of stigmatizing or marginalizing people? Guardrails can sometimes interfere with expressing local realities or unique contexts. As AIs with varying personalities and alignments develop, we must continually reassess the perspectives they adopt. This loop should also address AI degradation, where AIs trained on outputs from other AIs may produce increasingly artificial, less accurate information. This is a challenging loop to supervise, since AI misalignment may reflect the worldview or biases of its likely human supervisors and be hard to spot.

5. Epistemological Loop: AI’s Role in Evaluation. What / whose / which "intelligence" and "knowledge"?

At the highest level, this loop questions how AI influences the very nature of evaluation and evaluative thinking. Is AI enhancing or undermining our understanding of change? How does AI's hidden functioning relate to our ways of making sense of reality and change? Is AI narrowing or expanding our perspectives? Is it responsive to our deepest aspirations and principles? This loop interrogates AI's role in knowledge production, challenging the explainability of its processes and outcomes. It raises questions about the broader systems we operate within, including the existing interconnections, the powers, and the principles at stake. It demands that we consider whether and how AI use can fundamentally alter the nature of knowledge and its generation, as well as our humanity.


Within each loop, we need to consider several factors:

  1. AI vs. human supervision: Many loops can now be processed autonomously by AI. It’s crucial to determine which tasks can remain AI-driven and where human intervention is still essential to ensure proper oversight.
  2. Complexity of supervisory tasks: Even within the same loop, tasks vary in complexity. Routine tasks like validation checks can be easily standardized, while more complex activities—such as designing sampling strategies or identifying anomalies—require higher-level expertise, whether from humans or advanced AI systems.
  3. Range of supervisory approaches: Supervision now goes beyond merely controlling outputs. It’s increasingly about shaping the systems within which AI operates. Supervisory approaches might range from deploying automated tools for routine monitoring to guiding and auditing AI’s decision-making frameworks.

Why loops?

There are already good frameworks, such as Montrosse-Moorhead’s “Seven Criteria for Judging AI in Evaluation”, highlighting important factors for supervising AI use in evaluation - such as efficiency, equity, trust, methodological validity, and explainability. [Source: Montrosse-Moorhead, B. (2023). Evaluation criteria for artificial intelligence. New Directions for Evaluation, 2023, 123–134. https://doi.org/10.1002/ev.20566] The work by MERL Tech and AI ethics champions such Linda Raftree or Emily Springer also pinpoints and shares valuable concerns and criteria.

The added value of loops is to capture the dynamic nature of AI supervision in evaluation, emphasizing interconnectedness across various concerns— technical, ethical, and cultural. Here are some points worth highlighting:

  • Interconnectedness Across Layers: Loops emphasize how issues at different levels interact. A minor discrepancy in AI output could indicate deeper methodological or ethical issues, such as biased training data reflecting historical inequalities. So loops ensure that lower-level minor errors are seen for what they are: potential indicators of larger, systemic flaws.
  • Balance of technical check / systemic oversight. In the evaluation field, lower-level supervision (data validation, aligning findings to frameworks) often takes precedence over deeper systemic inquiries. The loop model encourages routine challenging of foundational models, methods, and values shaping AI-generated findings —and, with this, our work. As AI frees more time in data processing, are we keen to use it to question deeper loops, in ways that might potentially challenge existing power structures or call into question our approaches?
  • The know-how needed: Integration, collaboration: The loop model discourages siloed expertise, where junior staff handle technical tasks while senior experts address ethical considerations. Instead, it promotes rounded professional profiles and integrated exchanges, bridging technical proficiency and ethical awareness.
  • Accountability and Inclusivity: AI supervision risks mirroring the structure of tokenistic participatory evaluations, where low-level output may be scrutinized by all participants, but the underlying methodologies and epistemological choices remain largely unchallenged. Loops invite accountability at every level, potentially challenging stigma and bias, as well as values, choices, and power in evaluation.

When the learning loop idea was introduced in the organizational management field, it was revolutionary and challenging. In its simplicity, it revealed that organizations could not just think about "effectiveness" without checking their way of working and their own policies. Underlying assumptions had to be challenged.

Similarly, the supervision loops push us not to stop at low-level concerns (i.e. obtaining replicable and valid results) but also to check the nature of AI and the evaluation systems employing it. They highlight important aspects often overlooked in traditional evaluation - increasingly flagged by value-driven, transformative approaches. So AI supervision may catalyze much-needed debate on evaluation methodologies and practices. They reveal that AI biases often reflect our biases, and AI limitations mirror our own. For instance, the current emphasis on AI "explicability" is particularly telling. We now stress the need to "look into the black box" of AI decision-making, but this raises a crucial question: Was our traditional evaluation approach ever truly explicable to begin with? Have we considered whether our methods and paradigms were comprehensible to communities with different cultures or alternative ways of making sense of reality? This newfound focus on explicability in AI serves as a mirror, compelling us to examine the transparency and accessibility of our own long-standing evaluation practices.

By supervising AI at higher levels, we're compelled to interrogate our own work in evaluation. While this makes higher-level supervision challenging, it also offers an opportunity for transformation. Seeing our biases and challenges from a different perspective can help us address them more effectively.

So, who checks?

A key question is: who should oversee each loop? Interestingly, when posed to AI, it immediately presented a supervision process as a top-down hierarchy, with "higher echelons in power" overseeing the upper loops. This would clearly miss the point.

However, the AI answer tends to mirror mainstream thinking, reflecting an existing trend in evaluation where senior experts oversee higher-level assessments while junior staff handle routine checks at the lowest loops. While this division might be practical at times, it risks perpetuating the same issues that evaluation has faced historically—being used as a top-down mechanism for controlling change. The slow adoption of complex, reflective models in favour of linear data processing also reinforces a focus on outputs rather than systemic insights. If AI supervision follows this path, it could further marginalize vulnerable voices and constrain decision-making to those already in power.

To address this, we must rethink the entire approach to evaluation, not just AI supervision. Democratizing supervision loops can empower everyone—from junior staff to marginalized communities—fostering a deeper understanding of knowledge generation and its use in informing change. The aim is to challenge the foundational systems and values driving both AI and our assessments of reality, not just verify outputs.

These concerns aren’t new. The field of cybernetics, which informs AI, has long grappled with similar questions. Project Cybersyn in Chile, now seeing renewed interest, is a prime example. Developed by Stafford Beer for Salvador Allende’s government, it aimed to manage the economy in real-time using the Viable System Model (VSM), which functioned like a biological organism’s brain and nervous system. While Cybersyn’s goal differed from AI supervision, its architecture reflected supervisory loops. The core challenge Cybersyn addressed—ensuring control without falling into authoritarianism—remains relevant today. In Cybersyn, the operations room, or the “brain” of the system, wasn’t reserved for the president or the elite; it was meant for the people. This offers a crucial lesson for AI supervision: democratizing oversight can transform AI from a potential top-down tool into a driver of inclusive and transformative change.

Where to?

Ethical governance of AI requires questioning all loops, ensuring everyone, regardless of their role, is invited to participate in the supervision process - and has the right to do so. From an evaluator’s perspective, two parallel approaches can support this:

  1. Designing evaluation processes that invite scrutiny at every loop ensures that higher loops are not exclusively examined by managers or select individuals. This democratizes the evaluation process, enabling broader participation and ensuring that all levels contribute to high-level decision-making.
  2. Supporting continuous evaluation of AI itself. This is a massive task that no single individual can handle alone. However, evaluators play a crucial role by 1) advocating for accountable AI processes and outputs, transparent and open to scrutiny 2) applying their expertise to the growing field of AI evaluation. While not all evaluators will specialize in AI, they are all uniquely positioned to demand accountability. The evaluation community is a key player in advancing tools and approaches to strengthen collective efforts in this direction. This continuous oversight is essential to improving AI systems and, in turn, creating a virtuous cycle that enhances the field of evaluation itself.

Silva F.

Independent Consultant

4 个月
回复

Boru Douthwaite you might be interesting in this!

Silva F.

Independent Consultant

5 个月

Note: I feel guilty sharing something that might be perceived as a distraction from where our human and humanitarian focus should be. All attention remains on Gaza, Lebanon, and the region.

回复
Silva F.

Independent Consultant

5 个月

Thanks, Rick! What a trove of comments—I'm glad it sparked such rich conversation. I think you see the point: loops help highlight the fact that different levels of supervision exist. However, there's also a need to show how these levels connect. Many existing frameworks tend to align with one loop or another, but I believe there's valuable insight to be gained in examining how these levels interact. It's useful to have a tool that captures considerations at various levels, but it's equally important to question the links between them. I do see loops as "relaxed," as you mentioned, but as a way to overcome my fear fear that top-down thinking and linearity may also be applied to supervision. I wish people could recognize that misalignment in outputs isn't always something to "fix" with a tool. Sometimes, these tools can actually sweep deeper issues under the rug. My worry is that we'll see increasing fragmentation of tools and expertise, which is why it's essential to focus on the complexity and richness rather than the linearity of these layers of supervision. Rather than creating a massive, overcomplicated hype-framework, we need a "set of thinking glasses" when approaching the complexities of AI supervision to join the dots.

Rick Davies

Independent Monitoring and Evaluation Consultant

5 个月

Re Why loops, I think it is harder to implement this idea than at first sight. At least in its original conception (a hierarchy of logical types). You can have activities, criteria for assessing those activities, then criteria for assessing those criteria.... You see all three in dialogues around MSChanges. But I dont think I have ever seen attempts to to debate at higher levels of abstraction. But, I stll think your loops, interpreted in a more relaxed way, are well worth thinking about

回复

要查看或添加评论,请登录

Silva F.的更多文章

  • -simila

    -simila

    I am often uncomfortable when I try to explain what AI does. On one hand, it’s so much easier to use the words we…

  • Evaluation futures: AI’s shaping of and contribution to emerging landscapes.

    Evaluation futures: AI’s shaping of and contribution to emerging landscapes.

    NOTE: This is a rough first draft of a possible article. When I shared these reflections in meetings and workshops…

    1 条评论
  • DEC Appeal: heartfelt suggestions.

    DEC Appeal: heartfelt suggestions.

    Finally, the DEC appeal was launched. I hope organizations make good use of the significant funds (showing massive…

    3 条评论
  • "Presentabilty"

    "Presentabilty"

    A recent AEA365 article put forward the issue of "presentability", noting that “AI’s ability to enhance presentability…

    5 条评论
  • Tips for talking and communicating about Gaza

    Tips for talking and communicating about Gaza

    Note: I encountered many problematic narratives when reading current statements and communication about Gaza, even in…

    20 条评论
  • Navigating aid-washing: the maritime corridor for Gaza

    Navigating aid-washing: the maritime corridor for Gaza

    Note: This piece builds on a previous post, which, I realized, needed a more detailed exploration. It is presented as…

    51 条评论

社区洞察

其他会员也浏览了