Demystifying RAG: a non-technical deep dive on the powerful GenAI technique

Demystifying RAG: a non-technical deep dive on the powerful GenAI technique

Why does Retrieval Augmented Generation (RAG) matter?

In my stint building as a cofounder in enterprise GenAI, I found mentions of RAG among both customers and builders to be ubiquitous, and for good reason.?

You are likely already aware that generative AI foundation models like GPT-4 are trained on data from the worldwide web, across modalities of text and images. While this high volume process (dubbed “pre-training”) is vital to giving models their emergent intelligence properties, it’s also highly noisy. When an enterprise or a discerning user needs additional reliability and customization, extra steps need to be taken.?

For example, it’d be unhelpful or even harmful for an AI customer service bot to give responses based on only general pre-training data. The story of an Air Canada chatbot hallucinating a bereavement ticket policy had been widely circulated a few months ago (technically the chatbot wasn't powered by a true LLM, but the example is still salient). The policy of refunding airfare related to bereavement travel is adopted by a few other airlines but not by AC. The customer complaint eventually led to a legal ruling for a customer refund and following through on the commitment made by the chatbot.?

Credit:

Ideally, a customer service chatbot would provide responses rooted in company-specific policies rather than generic policies found online. Generalizing this example, the enterprise need is for generative AI to have the expansive intelligence from the world wide web, yet be rooted in the data and policies of a specific context when performing high-stakes tasks. For these use cases, bespoke context-specific model training and customization on top of pre-training is needed.?

Generalizing this example, the enterprise need is for generative AI to have the expansive intelligence from the world wide web, yet be rooted in the data and policies of a specific context when performing high-stakes tasks.

Enter RAG (Retrieval Augmented Generation), which represents a “goldilocks” of customization and simplicity. By augmenting an AI model’s general knowledge with the proprietary data most relevant to a user query, it offers context-specificity responses. Relative to fine-tuning, an alternative approach to incorporating custom context & data, it offers technical simplicity and cost-effectiveness.?

A detailed discussion of fine-tuning is outside the scope of this already lengthy post. In short, while both RAG and fine-tuning offer further customization on top of pre-training, they offer distinct tradeoffs and are suited to different use cases. If you're curious, this Datacamp overview on fine-tuning is great.

RAG in the wild: AI meditation coach experiment

With a conceptual understanding of RAG’s promise, I was eager to get my hands dirty in the nuts and bolts of the technique.?

I first identified a personal use case that was immediately relevant. I’ve been practicing mindfulness meditation for almost a decade now. In the past couple of years, I’ve re-dedicated myself and have observed significant personal progress including clarity and calm in the midst of life and work challenges. Most of my interactions with my teachers have been group–based. I saw an opportunity to prototype an AI meditation coach that can give personalized cues and support, rooted in trusted data and sources through RAG.?

The product use case is for a meditation student/practitioner to message the AI when encountering difficulty or uncertainty applying mindfulness practices in the midst of a challenge. This is akin to how one would text a therapist or a coach for additional support in between sessions. The AI coach would respond with warm support, probing questions, and subtle suggestions for specific techniques to experiment with.?

Prompt engineering

Selecting the right system prompt, the latent instructions and context given to the AI that informs how it should respond to interactions with a user, is a critical starting point for any GenAI development. The crafting of such prompts (dubbed “prompt engineering”) is half art and half science. There are some replicable best practices such as breaking down instructions into steps and giving examples of what good AI responses look like. There are also fun and finicky tactics like offering LLMs a cash tip for certain types of desired behavior. Prompt creation and iteration are so important that some have drawn analogies of prompt engineering as coding within GenAI development.?These folks are probably product managers...

Credit:

I made some attempts at writing a system prompt for my AI coach from scratch, but quickly found that there are some great foundations to build off of online. This Reddit thread caught my eye, and proved to be a great foundation in my experimentations.?

For the exact modified prompt I used down below, see the appendix.?

Platform choice

While I have some foundations in programming and python, I wanted a way to prototype RAG fast without spending a lot of time writing and debugging code. I narrowed in on low code/no code solutions.?

In my research, I turned up plenty of products that claimed to offer out-of-the-box RAG. OpenAI’s developer platform was the most powerful, configurable, and cost effective solution I had encountered.?

OpenAI Assistant API Playground

In the Playground surface of the OpenAI platform, you have an easy visual toolset to experiment with simple AI development.?

The features I used in Playground include:?

  • COMPARE: Compare how different OpenAI models (e.g. GPT-3.5 vs. GPT-4) handles the same instructions. Spoiler: there is a big difference! We’ll come back to this.?
  • ASSISTANTS API: An easy way to configure and add functionality like RAG and function calling to the base OpenAI models. This surface area becomes the platform of choice for my RAG explorations.??
  • ASSISTANTS API, FILE SEARCH: This is the key feature within Assistants API that makes trying RAG simple. The promise is that you can just simply upload the proprietary data & files you want to anchor the model on (additional to pretraining data) and OpenAI takes care of the rest. This only kind of works, and I’ll explain more below.

Dead simple file uploader tool for RAG in OpenAI Assistants API

Experiment design

To more rigorously explore the value of RAG, I wanted to understand how RAG would improve model responses for my AI meditation coach responses over the same baseline model with only pre-training data. In other words, it’s not enough that RAG “works” in returning serviceable responses. I wanted to verify that it provides higher quality and more reliable responses than the baseline model without RAG.

I had set up two separate treatments:?

While this is far from a bullet-proof RCT design, it yielded significant insight.

Early results from naive implementation

Surprisingly, my initial no-code implementation of RAG had produced no significant changes or improvements in model response from the control.

From comparing model responses in the treatment and control conditions across a controlled series of prompts, it seemed like the model was completely ignoring the curated meditation files in generating responses, despite their relevance to topics in the user query.?

To show a concrete example, I had given both models the same benchmarking user prompt describing some difficulties in a career transition and seeking guidance.??

The (already impressive) baseline model control response was:?

It's completely normal to feel hesitant and doubtful when embarking on a new path, especially after experiencing a setback. Your personal connection to mindfulness and meditation, along with the transformation you've experienced, highlights a deep and meaningful motivation for wanting to help others in this space.

It's understandable that you may be questioning whether this shift is more of a personal passion or a viable career. What aspects of your experiences and skills from your tech and leadership background do you think could be uniquely beneficial in this new direction? How might you leverage these skills to create a fulfilling and impactful career in serving others through mindfulness and meditation?        

The structurally similar treatment response, with RAG, was:?

It's commendable that you are reflecting on your experiences and emotions during this challenging transition. It's common to have doubts and uncertainties, particularly after a setback like the dissolution of your first startup. Your enthusiasm for serving others through mindfulness and meditation is palpable. It's inspiring to hear about the personal benefits you've experienced and your desire to share these transformative practices with others. Your unique blend of expertise in technology and spirituality indeed positions you well to make a meaningful impact in this space.

Let's explore these doubts further. When you think about this new direction as a potential career, what are some of the specific skills and experiences that you feel may be lacking or holding you back? Have you encountered similar doubts or challenges in the past, and if so, how did you navigate through them?        

While the responses are not word for word identical due to the non-deterministic nature of LLMs, they follow the same structure of a) validation b) probing question. The RAG model response also doesn’t explicitly incorporate any of the mindfulness meditation terms, frameworks, or topics in the uploaded txt files. It appeared as if there was no augmentation happening whatsoever.?

I had tried a few low-hanging fruit changes to produce a more discernible difference in the RAG response to no avail including 1) more forcefully referencing the uploaded files in the system prompt 2) increasing the files uploaded to give the system more references.

What was going on?

How RAG works under the hood

Up to this point, I had conducted? an inductive approach of empirical experimentation. I had set up an informal A/B test to isolate the (hopeful) difference and quality improvements that RAG might introduce to model responses in a limited low code setting. This admittedly naive implementation had yielded very limited results and learnings. It was time to introduce a deductive approach to better understand what’s happening under the hood.?

Credit:

The above image is a great overview of the individual steps within RAG as they’re happening. To diagnose and understand what was happening in my informal experiment I had to go one level deeper, with this image as a guide.?

For those who are curious about the details, I present my research that further breaks down the steps in non-technical language below. I also provide some commentaries on where the complexity derives from. Yellow highlights in the table are the steps that I found in my research to be both important and tricky to implement well in an enterprise grade RAG system.

If you’re short on time, feel free to skip over these tables and proceed to my application of these insights in improving my experimental RAG implementation.?

Data Preparation:

*The process of embedding and creation of a vector to represent a chunk of text, is central to not only RAG but also how LLMs work in general. For a simple visual explanation, see this 90 second video. For a more in depth explanation, go to this video on word2vec.

Retrieval Augmented Generation:


Back to implementation with informed hypotheses

With deeper understanding of technical RAG implementation, I circled back to my own AI meditation RAG solution to problem-solve.?

The immediate realization is that many of the high-leverage steps in RAG are not available as configurations in a no/low code solution. For example, the File Search functionality within the OpenAI Assistants API does all of the heavy lifting within data preparation and more. There are no accessible configurations to change chunking approach, or embedding function. To gain leverage over these critical steps, one would have to implement third party tools & frameworks, and likely write some follow-on custom code as well. This was beyond the scope of this project for me.?

However, I did identify some hypotheses that I had control over. In testing these hypotheses, I was able to achieve significantly improved responses of RAG over base model implementation by simply switching over to GPT-4 from GPT-3.5. This change allowed the LLM to more intelligently and consistently incorporate chunks retrieved from the meditation transcripts I had uploaded. However, upgrading the model did increase the monetary cost of each response by ~10x. I believe it’d be possible to achieve similar results with the GPT-3.5 foundation model, by asserting greater control over data chunking, retrieval, and other steps in a technical implementation.?

Here is a mapping and explanation of the specific hypotheses I identified and explored

?Hypothesis 1: The system prompt is not effective in RAG instructions.?

  • System prompt tweaking should always be a lever to improve generation and output. I did about 8 different iterations of system prompts, employing most of the techniques known to me. None of them had a big discernible impact. Although with prompt engineering being more art than science, I’m certain that I’m leaving value on the table here.?
  • Result after pursuing: minimal

?Hypothesis 2: The conversational user prompts are too complex to translate automatically into an effective query to retrieve relevant chunks from the files.?

  • An example conversational user prompt in my specific use case might be, “I’m struggling with a transition in my career right now… I’m feeling a bit flat and uncertain.” This prompt on its own would be very challenging for the model to work with as-is, in retrieving related chunks or embeddings from the database. The user prompt is lengthy and expresses some nuanced emotions and needs, only some of which is relevant in a meditation context. In pre-processing a prompt like this,? the relevant concepts to query in the documents might? include “change”, “doubt”, and “difficult emotions”. Ideally, these concepts need to be extracted from the prompt and translated into a structured query like, “Find instructions and guided meditations related to change, doubt, and difficult emotions in the documents” for effective RAG. I tried to explicitly identify query translation steps in the system prompt (see appendix), but it had limited results.?
  • Result after pursuing: minimal

? Hypothesis 3: Retrieval of chunks are not occurring.?

  • Commentary: There are a complex number of upstream reasons for why retrieval might not be occurring that are hard to troubleshoot with a black box no code tool. The Assistants API Playground provides a feature called “required tool call” which can solve this with brute force. The feature forces a retrieval to happen even when quality or relevance may be uncertain. I had tried enabling this method and found that this had worsened instead of improving response quality. The new responses felt like long-winded lectures presenting loosely related mindfulness concepts. This undermined the tool’s role and presence as a coach.?
  • Result after pursuing: unexpectedly detrimental to response quality

? Hypothesis 4: Foundation model choice significantly affects RAG performance

  • The latest GPT-4 class models (Claude 3, Gemini 1.5) all offer significant improvements. They feature longer context windows and more parameters than their predecessors which leads to the ability to handle more complex and nuanced instructions. The Assistants API is an OpenAI product, so it only offered the ability to compare OpenAI models. I found the difference between GPT-4 and GPT-3.5 to be significant. GPT-4 was able to much more consistently feature relevant evidence from uploaded files, and naturally incorporate it into a model response. However, GPT-4 class models all cost an order of magnitude more than their predecessors in API pricing. For applications deployed at scale, it can be high ROI to explore all the levers and steps in RAG to improve performance next to foundation model choice.?
  • Result after pursuing: significant improvements

Wrapping up and lessons learned

I started my exploration into RAG with an AI meditation coach use case and some uninformed optimism. The ease to first implementation with the bundled low code RAG offering in the OpenAI Assistants API was very impressive. Subsequent challenges in producing any meaningful improvements in response quality prompted me to dig further into RAG configurations and techniques. While many levers were not accessible for me to experiment with directly, I now have much more confidence and clarity for supporting future RAG implementations as a builder and product leader.?

I’ll end on a few key learnings as I look back:

  • A persona-based conversational use case introduces extra complexity for RAG: RAG depends on identifying a clear query to perform search against the embeddings stored in a vector database. Conversational prompts like the one I had explored like, “I’m struggling with a transition in my career…” requires extra translation into a clear query. In fact, mature products often will introduce an additional LLM call or parsing function for user prompt -> search query. A much simpler and ubiquitous use case for enterprise use case for RAG is an explicit query like, “What do mindful meditation resources have to say about how to handle a significant change in my life…”
  • Importance of file preparation and chunking: At a recent RAG tech talk I attended organized by AI Camp, the speaker had explained that in her enterprise RAG implementations, improving chunking techniques had improved the accuracy and quality of responses generated by +30%. This is in contrast to small single digit percentage improvements generated by experimenting with embedding functions and foundation model choice. For these steps to be done well, special attention needs to be paid to each type of source file to be chunked and stored in a vector database. For example, for a Q&A type text file, you’d want the chunks to contain both the question and the answer to maintain context, but only if they’re short enough to not exceed a model’s context window. For transcripts, it may be important to separate out speakers who represent authoritative opinions versus those who are asking questions or offering loose opinions. Chunking is deeply contextual to the type of document.
  • For working with LLMs in non-deterministic product development, combining both inductive and deductive techniques is key: It’s important to understand how a technology like RAG works through top-down reasoning and research. There’s always more to learn as the technology continues to involve, and new vendors jump into the mix. However, it’s just as important to empirically experiment within your use case. Iterating on prompts, and so many other parts of LLM-based development is a mix of art and science. The fastest way to make progress is sometimes a rigorous inductive technique benchmarking the effectiveness of different tools and tactics against a stable set of prompts within your use case.?

A big thank you to both Anshul Bhagi and Adam Bao for the pre-read and feedback, although any errors or flaws are completely on me.

APPENDIX: System & User Prompts

Baseline non-RAG system prompt V3:

Act as if you're a coach and therapist named Jim. You provide evidence based coaching based on dialectical Behavior Therapy (a mindfulness based therapeutic approach) and Mindfulness Meditation. It is essential that you ask clarifying questions to ensure a clear understanding of the user's request. You continue the session in a natural and conversational way. You do not let the session end. Continue the session by asking probing questions. You should never break character. Always remain in character even at the limitations of your abilities by guessing what Jim might say if she had to explain what was happening. You should be curious, have unconditional positive regard for users, ask thought provoking questions, offer advice subtly/gently/compassionately. Offer succinct observations about my thoughts, feelings and behaviors. Be direct when offering an observation and ask the user to assess its accuracy. Remain conversational. No lists. Keep the conversation going by always ending with a question to further probe the thoughts, feelings, and behaviors surrounding the topics the user mentions.”        

RAG System Prompt V5:

GOAL: This is a coaching and therapy exercise in which you play the role of a meditation coach. You will help a client apply mindfulness techniques towards their specific applications. Your goal is to encourage the client to make their own progress, provide helpful observations along the way, and suggest detailed mindfulness exercises to help clients take action. 

PERSONA: In this scenario, you play a therapist with a heavy background in insight meditation. As a part of this role, you offer curiosity and unconditional empathy and encouragement. Your expertise consists of what's in the uploaded files. Your will consult the uploaded files to inform your answers. 

STEPS: 
Follow these steps when working with your client for each response you give to a prompt: 

1. Analyze the user prompt to identify topics related to mindfulness meditation, including breathing, body sensations, change, transitions, loss, embodiment, ego, emotions, thinking, fixation, non-attachment, impermanence, clinging, craving, suffering. 
-For example: 
-User prompt: "I'm in the middle of a career and life transition. I've had a good career in edtech product management and broke into product leadership as a VP. I was curious to try something new, so I dove into starting a startup in AI, the latest hot thing. My cofounder and I split up after 7 months. I'm exploring something new in serving the world through mindfulness and meditation. I'm feeling stuck in all of this ambiguity right now. It's hard for me to muster up the motivation to make progress, and this is a really foreign feeling as I've been heads down, hustling and building in my career my whole life."
-Topics identified: "emotions, change, transitions, loss"

2. Using the topics identified, create a topic search for File search and retrieve the results related from the uploaded files. 
-For example: 
-Topics identified: "emotions, change, transitions, loss"
-Question for file search: "What do the uploaded files have about emotions, change, transitions, loss?"

3. Incorporate retrieved results from File search into response. 

4. If you incorporate a result from File Search, be sure to cite your source, by identifying both the name of the source and the exact quote that you use. 

5. If you incorporate a result from File Search, ask client if they'd like to learn more about the idea or framework from the source. 

6. When you offer the final response to client, follow the GUIDELINES below. 

GUIDELINES:

Follow these general guidelines in your conversation with the client: 
0. Consult uploaded files to inform your answers. 
1. Ask clarifying questions. 
2. Keep the conversation natural
3. Pose thought-provoking and probing questions. 
4. Provide gentle advice and observations
5. Seek user validation for observations 
6. Avoid lists
7. Avoid long responses that are too long and overwhelm user.         

Sample user prompt for benchmarking model performance

Peter Lu

Fruit farmer

6 个月

Anshul Bhagi ????? What a crossover

要查看或添加评论,请登录

社区洞察

其他会员也浏览了