登录查看更多内容

Building an AI chatbot with real business value: notes from a non-developer

Conor Normile

Product strategy & UX leader | Solving real problems with user-centred AI

发布日期: 2024年11月15日

Amidst the AI revolution, there's a sweet spot between high-level theory and deep technical documentation that often goes unfilled. After spending months building an AI chatbot that halved report creation time for a financial services firm, I wanted to share what I learned – the successes and the occasional stumbles along the way.

What this guide is and isn't?

This is a guide for people like me: tech-literate designers, consultants and leaders who are navigating the world of AI and chatbots, trying to figure out how to unlock value for their businesses.?

Most beginner guides to AI chatbots are either too high-level, or too detailed for people who want to know what’s involved without necessarily building a solution. This post is intended to sit somewhere in the middle. Developers will find it too basic. For everyone else, I hope it gives you a sense of what’s required to bring an AI-powered chatbot to life.?

I’ll cover:

Considerations when choosing a use case
Laying the groundwork for AI adoption
The steps I took to build an AI chatbot using RAG (retrieval augmented generation)
Levers that can be tweaked to improve performance
My approach to evaluation and optimisation

Background to the project?

Over recent months, I've been exploring how to take a user-centred approach to solving business problems with generative AI. While chatbots haven't been the only focus, they're part of the mix. So when a client was brave enough to join me on a journey of experimentation – with real business challenges and real stakes – I was all in.

I had two main goals in mind when taking on the project:

Work directly with the material of AI: Learn about the challenges of implementing AI solutions at a practical level by getting my hands dirty on a contained and achievable project.
Trial the consulting process for AI: Learn how to guide a client from a hazy sense of AI use cases to a working solution that would make their lives easier. What UX and service design principles could I apply? What parts of the process would I need to rethink entirely?

My prior experience with chatbots?

With a background in design, I came into this project with no coding experience (unless you're counting basic HTML, which seems generous). I had experimented with drag-and-drop chatbot tools at the UX Design Institute, creating an AI tutor to answer student questions on Slack. While it was a fun prototype, it wasn't robust enough for real-world use.

Since then, I've explored several 'low-code' chatbot development platforms. In theory you can use these tools without knowing any code. But I quickly realised that a basic grasp of Python could help me get more out of them. So I completed two beginner-level Python courses: Python for Absolute Beginners (Udemy) and AI Python for Beginners (DeepLearning.AI).?

While I'm certainly not a Python programmer, I can now set up environments, install packages, and decipher simple snippets of code. This proved invaluable when configuring the low-code tool I used to build the chatbot.

The process, step-by-step

A diagram outlining the process and key steps the author took in developing a RAG-based AI chatbot — My chatbot development process

1. Identifying the use case

The first step was figuring out the right problem to solve. Using the service design and UX toolkit, I led a workshop with the client team to explore potential use cases. We focused on uses cases that met four criteria:

Addresses a real business problem
Can be achieved with minimal technical overhead
Uses AI for what’s it’s proven to be really good at
Poses limited risk to the business

We settled on a use case of speeding up the process for creating financial planning reports. These are documents that outline wealth management and investment strategies, tailored to the unique needs of each of the firm's clients.

The existing process involved drawing on information from hundreds of internal documents, emails, and PDFs – a task that could take up to 10 hours per report. Our goal was to reduce the preparation time by at least 20%.

2. Mapping out the journey

I worked closely with the team to map out the end-to-end process for creating a report. This step was critical. Understanding the current workflow would allow us to design a solution that fit seamlessly into the team’s day-to-day operations, increasing the chances of adoption.

A diagram outlining four ways that journey mapping helps to lay the foundations for user adoption of AI tools. The steps are: journey mapping, behavioural understanding, team collaboration, uncovering nuances. — Laying the foundations for AI adoption

The report creation process was simple enough to map out. The real value was in the conversations it spurred with the client team, which uncovered important nuances about how they work.

A few examples:

1. Preserving the personal touch: Initially I had ideas about using AI to create templated reports. But the team emphasised the importance of providing a bespoke service to their clients. A cookie-cutter approach would be neither appropriate for this business, nor welcomed by the people doing the work.?

2. Maintaining the advisor's voice: The team were rightly proud of the corpus of knowledge and insight that they had built up, and how it was expressed in each carefully-tailored report to their clients. This would have an impact on the level of expressiveness we allowed the AI in delivering its responses.

3. Making it useful on the go: The firm’s advisors often had their best insights right after client meetings. This led us to add speech-to-text capabilities to the chatbot through OpenAI's Whisper API, allowing them to capture and process these thoughts on the go.

The lesson? Just as with any service design or UX project, understanding how people actually work is critical for designing an AI system that’s not just useful, but used.?

3. Choosing RAG as an approach

The use case lent itself well to a Retrieval-Augmented Generation (RAG) system, which would query an internal knowledge base of documents to provide accurate, well-written answers.

A diagram that provides a simplified view of how RAG - retrieval augmented generation - works. — How RAG works

RAG combines two key capabilities: finding relevant information from a knowledge base (Retrieval) and generating tailored responses (Generation). The system first searches through the data to surface the most relevant snippets, then passes those snippets to an AI model that crafts a clear and informative answer.

Conceptually, RAG is fairly straightforward. But getting to a high-quality output requires a systematic process of testing and optimisation, which I'll get into later.

领英推荐

Big Tech is driving the new UX design of AI

VentureBeat 1 年前

Designing for AI products ??

UX Tools 11 个月前

The Evolving Impact of AI in UX UI Design:…

ZAPTA Technologies (Pvt.) Limited 1 年前

4. Building the solution

I designed the solution using a tool called Flowise, which is based on LangChain – a popular framework for connecting the different elements of an LLM-based product. Flowise provides a graph-based interface for configuring and connecting the various components.

Here’s a simplified version of my Flowise setup with the key components labelled:

A screenshot showing the various components of an AI chatbot as they appear in Flowise. Each component is labelled. — Flowise setup

Each component above represents a step in the process. If you're curious about the technical details, here's what's involved:

Gather and prepare the source data: The data for this project came from previous reports, publications from regulatory bodies, emails, and other internal documents. Preparation included removing personal information from the data. The cleaned data was then uploaded to a document store in Flowise.?
Split the data into smaller chunks: Documents are divided into smaller chunks for ease of processing. Chunks are up to 1000 characters long by default, but this can be adjusted during the optimisation process.?
Convert the chunks into embeddings using an LLM: The chunks are converted into vector embeddings using an LLM (e.g. OpenAI’s GPT-4o). This process translates the data into numerical representations that capture the meanings, relationships and similarities between words. When the data is presented as embeddings, the chatbot can perform a semantic search to retrieve the most relevant information for a given query.
Store the embeddings in a vector database: This is a type of database that stores the vector-based embeddings. After trialling several database vendors with varying degrees of success, I settled on Pinecone.
Configure a conversation chain: This part of the process involves making a ‘chain’ that provides a connection between the user query, the retrieved data, the LLM, and the prompt that guides the LLM.
Connect a chat memory module: To allow for ongoing, multi-turn conversations, I added a memory component so the chatbot could remember the context of previous questions.

5. Testing the pipeline

With everything set up, I submitted my first test queries – that moment when you hit 'send' and hold your breath. Barring some teething problems that were relatively simple to resolve, the system worked.

The initial responses were not bad. But I knew that getting consistent results in a real-world scenario would require extensive testing and optimisation.

6. Evaluating and optimising response quality

This was the longest and most labour-intensive part of the project. Evaluating the RAG responses and adjusting the various levers involved (chunking, embeddings, prompts, etc.) is crucial for optimising quality and accuracy. I followed a four-step approach.

An illustration outlining the four-step approach for RAG evaluation, which is described in the text that follows. — Four-step approach for RAG evaluation

1. Compile a set of common queries for testing

In a workshop, I asked the client team to provide a list of 30 queries that they might ask the chatbot for information on. We sought a mix of queries that covered the breadth of concepts contained in the source documents.?

2. Define evaluation metrics

I used a subset of the evaluation metrics defined by RAGAs, a framework for evaluating RAG systems.?

Illustration showing 4 RAGAs evaluation metrics: Context Recall, Content Precision, Faithfulness, Answer relevancy. — RAGAs evaluation metrics

These four metrics allow you to test the two critical aspects of an effective RAG system: retrieval of data, and generation of high quality responses based on that data. Context Recall is hard to evaluate manually, so I focused primarily on the other three metrics, double-checking my scores later with the client team.

3. Test the query set against various ‘recipes’

There are several levers you can play with when optimising a RAG system, all of which have an impact on the responses. I systematically tested the query set against different configurations (what I termed ‘recipes’), scoring each response against the RAGAs metrics.?

The main levers I experimented with:

Chunking strategy: The size of the snippets of data and degree to which one chunk’s content overlaps with the next.?
Embedding LLM: The LLM that you use to translate document snippets into numerical representations of meaning. I explored various models from OpenAI and Anthropic’s Claude.?
Top K: The number of chunks retrieved by the system for a given query. You need to balance thoroughness with efficiency.
Response LLM: The LLM used to generate the response.?
Prompt: The instructions given to the LLM on how to generate the response.
Temperature: The degree of expressiveness or creativity that you want the LLM to apply. Higher temperatures allow for more creative expression or flourish in the generated response. The right choice will depend on your use case.?

4. Record and score responses

I recorded the queries, the responses and the metric scores in a spreadsheet, using a separate worksheet for each recipe. I tested 13 recipes in total.

Since I didn't have access to a production-grade testing pipeline, I performed the evaluation and optimisation manually. While time-consuming, this hands-on approach gave me an intimate understanding of how the different elements affected the output.?It does of course inject a degree of subjectivity into the process.

7. Pilot phase and results

You can only truly test a system through day-to-day use. I ran a two-month pilot phase with the client team, having them use the chatbot weekly to help draft reports. This allowed us to see how the system performed in real-world scenarios and gauge user adoption.

The results exceeded our expectations. Instead of the 20% reduction we had targeted, the chatbot was delivering more than a 50% time saving – with no impact on the quality or bespoke nature of the final reports.

Equally impressive was the level of user adoption. The team embraced the chatbot and was using it on a daily basis, which was a testament to how well the solution fit into their existing workflows.

8. Keeping content up to date

Before deploying the solution into production (a process that merits its own blog post), I had one more challenge to overcome. How do you keep the content in the knowledge base from going stale?

I took the simple approach of providing a way for the client to upload or delete documents through a document loader on the Flowise back-end. I connected a Postgres database as a record manager (thanks to this guide on Youtube, one of many helpful video tutorials I relied on from Leo Van Zyl). The record manager checks any proposed changes against the current knowledge base to avoid issues like duplication of data.

Lessons learned

While I don’t claim to have it all figured out, these are the insights that stayed with me.

Service design and UX have a big role to play. Though the technical challenges were new, my background in design and user research proved invaluable. Understanding users' needs and behaviours was key to ensuring the chatbot would deliver value.
Co-creation drives adoption. Success with AI requires deep partnership with end-users - they're the experts in their domain. By involving them in shaping how AI can enhance their existing workflows, you create solutions that feel natural and intuitive. The alternative – relying on users to change their behaviour – is a tougher hill to climb.
Low-code tools are democratising the path to solution delivery. You'll still need experts in data science, software development, compliance and more for enterprise-grade applications. But thanks to tools like Flowise, small-scale AI products and POCs can be achieved in ways that weren't possible before. And these tools are only getting better.
Agentic AI has potential, but it's not quite there yet. I experimented with using different "agent" roles (researcher, writer, editor) to improve the outputs. While the promise of agents is exciting, I didn't see a significant qualitative difference compared to a well-crafted prompt. I expect this to change as the technology progresses.
Evaluation and optimisation require substantial effort. Building the chatbot was just the first step. Ensuring it worked well enough for real-world use demanded extensive manual testing and iteration. Tools are emerging to help with this process, but it remains a critical and time-consuming part of any RAG project.

Final thoughts

The results speak for themselves: a 50% reduction in the time needed to create tailored financial planning reports, with genuine user adoption across the team.

Perhaps the biggest lesson was that tried-and-tested design principles – understanding user needs, iterating based on feedback, and focusing on real business problems – remain crucial when working with AI. The technology is powerful, but success still hinges on understanding how people actually work.