Bringing Humans into the Loop: User Researchers and User Feedback's role in LLM Observability
Humans in the loop, generated using Dall-e 3

Bringing Humans into the Loop: User Researchers and User Feedback's role in LLM Observability

Introduction

In Part 1 of this series, I described how our team at Cisco built an internal generative AI assistant—our “safe and authorized” way to boost productivity and creativity at work. Guided by user research insights and a robust prioritization framework, we set a clear roadmap for our generative AI tool.

Now, in Part 2, I’ll discuss the next critical step of our journey: ensuring the quality of AI responses after launch and designing a feedback loop that people across our enterprise will actually use.

Why Feedback Loops Matter in Generative AI

When you get a subpar response from a tool like ChatGPT or Claude, what do you typically do?

? Try a different prompt?

? Hit the thumbs-down button?

? Type an exasperated follow-up message?

When I first started learning about generative AI chat experiences, I often heard the refrain: “gen AI learns from what users' input.” A majority of people around me had some vaguely held belief that if they provided some negative reaction – be it a thumbs down or a reprimanding follow up prompt - that their generative AI tool would improve immediately...or at least overnight.??

It’s natural to assume that your input alone will instantly “teach” the AI to do better. But in an enterprise context, improving generative AI isn’t that simple. Yes, generative AI chat tools implemented inside a company can improve – but how???

The Reality of “Self-Learning”

Many large language models (LLMs) already incorporate Reinforcement Learning from Human Feedback (RLHF)—a process where human reviewers label generated responses as “good” or “bad,” which trains a secondary model to improve outputs. Yet this training is extremely resource-intensive.

There’s growing interest in having LLMs “self-learn” by generating their own synthetic data. For instance, Google DeepMind has explored “Socratic” self-learning. However, relying too heavily on self-generated data can reinforce biases or misinformation, risking what some call “AI collapse” or “dangerous misalignment.”

Completely removing humans from the loop is risky. Even if your enterprise relies on a managed model like GPT-4 from Azure OpenAI, you’ll still need processes to fine-tune and maintain your retrieval-augmented generation (RAG) pipeline.

Where Humans Fit In

From a design strategy perspective, it’s crucial to recognize that role of response-level feedback in maintaining user trust. The core value proposition of a generative AI product depends on consistent, high-quality results.

Here’s what often happens in an enterprise setting:

A user “thumbs-down” an AI response.

Researchers or product owners or AI engineers review that feedback and try to diagnose the root cause. They may need to:

  • Fine-tune the model with better examples.
  • Make UX/UI adjustments so users understand how to prompt effectively.
  • Provide user training or example prompts that encourage clarity and specificity.
  • Improve the overall RAG pipeline (e.g., caching, indexing, using specialized models, etc.).

Subject matter experts (SMEs) or content owners may need to:

  • Add missing information to the knowledge base.
  • Remove or update outdated content.
  • Improve existing content guidelines.

Without human oversight and collaboration — including user researchers, designers, and SMEs — the AI’s performance can degrade. This mirrors the challenges we once saw with enterprise search, where unmanaged or outdated content turned search results into a minefield.

Translating Dissatisfaction into Actionable Recommendations

How do we handle the response-level feedback coming into our product???

A big challenge in my role as a UX researcher has been interpreting response-level negative feedback and proposing a systematic way to generate recommendations that positively affect response quality.

This aspect of the user experience management is sometimes referred to as "LLM Observability" and is part of an enterprise's broader "LLM Operations."

There needs to be a well-staffed system of humans who ensure results from generative AI tools are accurate. In order to do this we could:?

  1. Ask humans (including SMEs) to evaluate outputs based on established categories. This is effective but expensive.
  2. Use a heuristics framework such as RAGAS, BLEU, or ROUGE to compare current outputs with a reference list of outputs, which requires some development knowledge.
  3. Use a model trained on "good outputs" to assess current output, which requires much more development knowledge.

While we may be able to tackle option 1 initially, the cost is not scaleable to the feedback coming in from across the entire enterprise.

Busy humans need a clearer, more streamlined way to address inaccurate or outdated content. Without their timely intervention, our AI tool’s quality could slip.

But we hit a roadblock trying to enlist methods 2 and 3 as user researchers without the necessary development skills.

In order to reach some middle ground, we took on option 1 (categorizing and enlisting human reviewers) for a sample set of incoming feedback, when time allows between research projects. But we have our eyes set on more automated ways for review once we have more development resources.


Why User Experience Leaders Should Care

LLM observability proves to be a user experience issue.

For example, quarter-over-quarter one of the top two reasons for users' dissatisfaction with our tool is "lack of accurate responses."

If we notice a drop in task completion, for example a 10-point drop in users able to complete a high stakes task such as "finding internal product information," we need LLM observability in place to improve response quality. No amount UI improvements or AI learning and enablement alone can account for poor response quality.

For design strategists and AI leaders, the takeaway is:

? Invest in feedback loops that make it easy for users to report poor responses.

? Staff a review system with AI engineers / data scientists, UX staff, and content SMEs who can analyze and correct errors.

? Design for AI products with a service design lens, with a focus on documented collaboration across humans and with clear governance for adding, updating, or removing enterprise content.

This ensures that as our AI product evolves, it remains accurate, trustworthy, and aligned with business goals. At the end of the day, if our end users lose trust in AI response quality, the overall experience is at stake.

Up next

In Part 3 and 4, I’ll continue my reflections in the role leading user experience researcher for our enterprise-wide, employee AI experience and cover measures and metrics that seem to matter most.

Stay tuned for more ideas on keeping humans front and center in AI product development.

要查看或添加评论,请登录

areebah A.的更多文章

社区洞察

其他会员也浏览了