登录查看更多内容

Bringing Humans into the Loop: User Researchers and User Feedback's role in LLM Observability

areebah A.

designing human-centered AI in the workplace

发布日期: 2025年2月26日

Introduction

In Part 1 of this series, I described how our team at Cisco built an internal generative AI assistant—our “safe and authorized” way to boost productivity and creativity at work. Guided by user research insights and a robust prioritization framework, we set a clear roadmap for our generative AI tool.

Now, in Part 2, I’ll discuss the next critical step of our journey: ensuring the quality of AI responses after launch and designing a feedback loop that people across our enterprise will actually use.

Why Feedback Loops Matter in Generative AI

When you get a subpar response from a tool like ChatGPT or Claude, what do you typically do?

? Try a different prompt?

? Hit the thumbs-down button?

? Type an exasperated follow-up message?

When I first started learning about generative AI chat experiences, I often heard the refrain: “gen AI learns from what users' input.” A majority of people around me had some vaguely held belief that if they provided some negative reaction – be it a thumbs down or a reprimanding follow up prompt - that their generative AI tool would improve immediately...or at least overnight.??

It’s natural to assume that your input alone will instantly “teach” the AI to do better. But in an enterprise context, improving generative AI isn’t that simple. Yes, generative AI chat tools implemented inside a company can improve – but how???

The Reality of “Self-Learning”

Many large language models (LLMs) already incorporate Reinforcement Learning from Human Feedback (RLHF)—a process where human reviewers label generated responses as “good” or “bad,” which trains a secondary model to improve outputs. Yet this training is extremely resource-intensive.

There’s growing interest in having LLMs “self-learn” by generating their own synthetic data. For instance, Google DeepMind has explored “Socratic” self-learning. However, relying too heavily on self-generated data can reinforce biases or misinformation, risking what some call “AI collapse” or “dangerous misalignment.”

Completely removing humans from the loop is risky. Even if your enterprise relies on a managed model like GPT-4 from Azure OpenAI, you’ll still need processes to fine-tune and maintain your retrieval-augmented generation (RAG) pipeline.

Where Humans Fit In

From a design strategy perspective, it’s crucial to recognize that role of response-level feedback in maintaining user trust. The core value proposition of a generative AI product depends on consistent, high-quality results.

Here’s what often happens in an enterprise setting:

A user “thumbs-down” an AI response.

Researchers or product owners or AI engineers review that feedback and try to diagnose the root cause. They may need to:

Fine-tune the model with better examples.
Make UX/UI adjustments so users understand how to prompt effectively.
Provide user training or example prompts that encourage clarity and specificity.
Improve the overall RAG pipeline (e.g., caching, indexing, using specialized models, etc.).

Subject matter experts (SMEs) or content owners may need to:

Add missing information to the knowledge base.
Remove or update outdated content.
Improve existing content guidelines.

Without human oversight and collaboration — including user researchers, designers, and SMEs — the AI’s performance can degrade. This mirrors the challenges we once saw with enterprise search, where unmanaged or outdated content turned search results into a minefield.

领英推荐

AI Talks, Global Engagement, and Future Legislation:…

Generative AI 10 个月前

What Is the Most Famous Generative AI?

Bernard Marr 7 个月前

What is generative AI and how does it work?

Algolia 1 年前

Translating Dissatisfaction into Actionable Recommendations

How do we handle the response-level feedback coming into our product???

A big challenge in my role as a UX researcher has been interpreting response-level negative feedback and proposing a systematic way to generate recommendations that positively affect response quality.

This aspect of the user experience management is sometimes referred to as "LLM Observability" and is part of an enterprise's broader "LLM Operations."

There needs to be a well-staffed system of humans who ensure results from generative AI tools are accurate. In order to do this we could:?

Ask humans (including SMEs) to evaluate outputs based on established categories. This is effective but expensive.
Use a heuristics framework such as RAGAS, BLEU, or ROUGE to compare current outputs with a reference list of outputs, which requires some development knowledge.
Use a model trained on "good outputs" to assess current output, which requires much more development knowledge.

While we may be able to tackle option 1 initially, the cost is not scaleable to the feedback coming in from across the entire enterprise.

Busy humans need a clearer, more streamlined way to address inaccurate or outdated content. Without their timely intervention, our AI tool’s quality could slip.

But we hit a roadblock trying to enlist methods 2 and 3 as user researchers without the necessary development skills.

In order to reach some middle ground, we took on option 1 (categorizing and enlisting human reviewers) for a sample set of incoming feedback, when time allows between research projects. But we have our eyes set on more automated ways for review once we have more development resources.

Why User Experience Leaders Should Care

LLM observability proves to be a user experience issue.

For example, quarter-over-quarter one of the top two reasons for users' dissatisfaction with our tool is "lack of accurate responses."

If we notice a drop in task completion, for example a 10-point drop in users able to complete a high stakes task such as "finding internal product information," we need LLM observability in place to improve response quality. No amount UI improvements or AI learning and enablement alone can account for poor response quality.

For design strategists and AI leaders, the takeaway is:

? Invest in feedback loops that make it easy for users to report poor responses.

? Staff a review system with AI engineers / data scientists, UX staff, and content SMEs who can analyze and correct errors.

? Design for AI products with a service design lens, with a focus on documented collaboration across humans and with clear governance for adding, updating, or removing enterprise content.

This ensures that as our AI product evolves, it remains accurate, trustworthy, and aligned with business goals. At the end of the day, if our end users lose trust in AI response quality, the overall experience is at stake.

Up next

In Part 3 and 4, I’ll continue my reflections in the role leading user experience researcher for our enterprise-wide, employee AI experience and cover measures and metrics that seem to matter most.

Stay tuned for more ideas on keeping humans front and center in AI product development.

要查看或添加评论，请登录

areebah A.的更多文章

Guiding generative AI in the enterprise: A user experience researcher’s perspective

2025年1月21日

Guiding generative AI in the enterprise: A user experience researcher’s perspective

Part 1: Laying the Groundwork This is the first installment in a four-part reflection on my role as a senior user…

Bringing Humans into the Loop: User Researchers and User Feedback's role in LLM Observability

areebah A.

designing human-centered AI in the workplace

Introduction

Why Feedback Loops Matter in Generative AI

The Reality of “Self-Learning”

Where Humans Fit In

领英推荐

Translating Dissatisfaction into Actionable Recommendations

Why User Experience Leaders Should Care

For design strategists and AI leaders, the takeaway is:

Up next

areebah A.的更多文章

社区洞察

其他会员也浏览了

Generative AI: Unleashing Creativity with Artificial Intelligence

AI Superpowers: Empowering Your Team for the Future

AI Productivity & Management I AI-Powered CEO X Factor Series I Chapter 1: The Big Picture

Understanding AI: The Three Categories Shaping Our Future

Business Applications of Generative AI

DeepSeek AI vs. ChatGPT: A Detailed Comparison of Use Cases, Real-Life Applications, and Future Potential

Navigating the World of Generative AI: A Guide to Essential Terminology

DeepSeek: Revolutionizing AI Reasoning Beyond ChatGPT and Other Competitors

Understanding Memory in AI: How LLMs Remember What Matters

Navigating the AI Glossary: Essential Terms Simplified

Introduction

Why Feedback Loops Matter in Generative AI

The Reality of “Self-Learning”

Where Humans Fit In

领英推荐

Translating Dissatisfaction into Actionable Recommendations

Why User Experience Leaders Should Care

For design strategists and AI leaders, the takeaway is:

Up next

areebah A.的更多文章

Guiding generative AI in the enterprise: A user experience researcher’s perspective

社区洞察

其他会员也浏览了

Generative AI: Unleashing Creativity with Artificial Intelligence

AI Superpowers: Empowering Your Team for the Future

AI Productivity & Management I AI-Powered CEO X Factor Series I Chapter 1: The Big Picture

Understanding AI: The Three Categories Shaping Our Future

Business Applications of Generative AI

DeepSeek AI vs. ChatGPT: A Detailed Comparison of Use Cases, Real-Life Applications, and Future Potential

Navigating the World of Generative AI: A Guide to Essential Terminology

DeepSeek: Revolutionizing AI Reasoning Beyond ChatGPT and Other Competitors

Understanding Memory in AI: How LLMs Remember What Matters

Navigating the AI Glossary: Essential Terms Simplified