Demystifying RAG: a non-technical deep dive on the powerful GenAI technique
Why does Retrieval Augmented Generation (RAG) matter?
In my stint building as a cofounder in enterprise GenAI, I found mentions of RAG among both customers and builders to be ubiquitous, and for good reason.?
You are likely already aware that generative AI foundation models like GPT-4 are trained on data from the worldwide web, across modalities of text and images. While this high volume process (dubbed “pre-training”) is vital to giving models their emergent intelligence properties, it’s also highly noisy. When an enterprise or a discerning user needs additional reliability and customization, extra steps need to be taken.?
For example, it’d be unhelpful or even harmful for an AI customer service bot to give responses based on only general pre-training data. The story of an Air Canada chatbot hallucinating a bereavement ticket policy had been widely circulated a few months ago (technically the chatbot wasn't powered by a true LLM, but the example is still salient). The policy of refunding airfare related to bereavement travel is adopted by a few other airlines but not by AC. The customer complaint eventually led to a legal ruling for a customer refund and following through on the commitment made by the chatbot.?
Ideally, a customer service chatbot would provide responses rooted in company-specific policies rather than generic policies found online. Generalizing this example, the enterprise need is for generative AI to have the expansive intelligence from the world wide web, yet be rooted in the data and policies of a specific context when performing high-stakes tasks. For these use cases, bespoke context-specific model training and customization on top of pre-training is needed.?
Generalizing this example, the enterprise need is for generative AI to have the expansive intelligence from the world wide web, yet be rooted in the data and policies of a specific context when performing high-stakes tasks.
Enter RAG (Retrieval Augmented Generation), which represents a “goldilocks” of customization and simplicity. By augmenting an AI model’s general knowledge with the proprietary data most relevant to a user query, it offers context-specificity responses. Relative to fine-tuning, an alternative approach to incorporating custom context & data, it offers technical simplicity and cost-effectiveness.?
A detailed discussion of fine-tuning is outside the scope of this already lengthy post. In short, while both RAG and fine-tuning offer further customization on top of pre-training, they offer distinct tradeoffs and are suited to different use cases. If you're curious, this Datacamp overview on fine-tuning is great.
RAG in the wild: AI meditation coach experiment
With a conceptual understanding of RAG’s promise, I was eager to get my hands dirty in the nuts and bolts of the technique.?
I first identified a personal use case that was immediately relevant. I’ve been practicing mindfulness meditation for almost a decade now. In the past couple of years, I’ve re-dedicated myself and have observed significant personal progress including clarity and calm in the midst of life and work challenges. Most of my interactions with my teachers have been group–based. I saw an opportunity to prototype an AI meditation coach that can give personalized cues and support, rooted in trusted data and sources through RAG.?
The product use case is for a meditation student/practitioner to message the AI when encountering difficulty or uncertainty applying mindfulness practices in the midst of a challenge. This is akin to how one would text a therapist or a coach for additional support in between sessions. The AI coach would respond with warm support, probing questions, and subtle suggestions for specific techniques to experiment with.?
Prompt engineering
Selecting the right system prompt, the latent instructions and context given to the AI that informs how it should respond to interactions with a user, is a critical starting point for any GenAI development. The crafting of such prompts (dubbed “prompt engineering”) is half art and half science. There are some replicable best practices such as breaking down instructions into steps and giving examples of what good AI responses look like. There are also fun and finicky tactics like offering LLMs a cash tip for certain types of desired behavior. Prompt creation and iteration are so important that some have drawn analogies of prompt engineering as coding within GenAI development.?These folks are probably product managers...
I made some attempts at writing a system prompt for my AI coach from scratch, but quickly found that there are some great foundations to build off of online. This Reddit thread caught my eye, and proved to be a great foundation in my experimentations.?
For the exact modified prompt I used down below, see the appendix.?
Platform choice
While I have some foundations in programming and python, I wanted a way to prototype RAG fast without spending a lot of time writing and debugging code. I narrowed in on low code/no code solutions.?
In my research, I turned up plenty of products that claimed to offer out-of-the-box RAG. OpenAI’s developer platform was the most powerful, configurable, and cost effective solution I had encountered.?
In the Playground surface of the OpenAI platform, you have an easy visual toolset to experiment with simple AI development.?
The features I used in Playground include:?
Experiment design
To more rigorously explore the value of RAG, I wanted to understand how RAG would improve model responses for my AI meditation coach responses over the same baseline model with only pre-training data. In other words, it’s not enough that RAG “works” in returning serviceable responses. I wanted to verify that it provides higher quality and more reliable responses than the baseline model without RAG.
I had set up two separate treatments:?
While this is far from a bullet-proof RCT design, it yielded significant insight.
Early results from naive implementation
Surprisingly, my initial no-code implementation of RAG had produced no significant changes or improvements in model response from the control.
From comparing model responses in the treatment and control conditions across a controlled series of prompts, it seemed like the model was completely ignoring the curated meditation files in generating responses, despite their relevance to topics in the user query.?
To show a concrete example, I had given both models the same benchmarking user prompt describing some difficulties in a career transition and seeking guidance.??
The (already impressive) baseline model control response was:?
It's completely normal to feel hesitant and doubtful when embarking on a new path, especially after experiencing a setback. Your personal connection to mindfulness and meditation, along with the transformation you've experienced, highlights a deep and meaningful motivation for wanting to help others in this space.
It's understandable that you may be questioning whether this shift is more of a personal passion or a viable career. What aspects of your experiences and skills from your tech and leadership background do you think could be uniquely beneficial in this new direction? How might you leverage these skills to create a fulfilling and impactful career in serving others through mindfulness and meditation?
The structurally similar treatment response, with RAG, was:?
It's commendable that you are reflecting on your experiences and emotions during this challenging transition. It's common to have doubts and uncertainties, particularly after a setback like the dissolution of your first startup. Your enthusiasm for serving others through mindfulness and meditation is palpable. It's inspiring to hear about the personal benefits you've experienced and your desire to share these transformative practices with others. Your unique blend of expertise in technology and spirituality indeed positions you well to make a meaningful impact in this space.
Let's explore these doubts further. When you think about this new direction as a potential career, what are some of the specific skills and experiences that you feel may be lacking or holding you back? Have you encountered similar doubts or challenges in the past, and if so, how did you navigate through them?
While the responses are not word for word identical due to the non-deterministic nature of LLMs, they follow the same structure of a) validation b) probing question. The RAG model response also doesn’t explicitly incorporate any of the mindfulness meditation terms, frameworks, or topics in the uploaded txt files. It appeared as if there was no augmentation happening whatsoever.?
领英推荐
I had tried a few low-hanging fruit changes to produce a more discernible difference in the RAG response to no avail including 1) more forcefully referencing the uploaded files in the system prompt 2) increasing the files uploaded to give the system more references.
What was going on?
How RAG works under the hood
Up to this point, I had conducted? an inductive approach of empirical experimentation. I had set up an informal A/B test to isolate the (hopeful) difference and quality improvements that RAG might introduce to model responses in a limited low code setting. This admittedly naive implementation had yielded very limited results and learnings. It was time to introduce a deductive approach to better understand what’s happening under the hood.?
The above image is a great overview of the individual steps within RAG as they’re happening. To diagnose and understand what was happening in my informal experiment I had to go one level deeper, with this image as a guide.?
For those who are curious about the details, I present my research that further breaks down the steps in non-technical language below. I also provide some commentaries on where the complexity derives from. Yellow highlights in the table are the steps that I found in my research to be both important and tricky to implement well in an enterprise grade RAG system.
If you’re short on time, feel free to skip over these tables and proceed to my application of these insights in improving my experimental RAG implementation.?
Data Preparation:
*The process of embedding and creation of a vector to represent a chunk of text, is central to not only RAG but also how LLMs work in general. For a simple visual explanation, see this 90 second video. For a more in depth explanation, go to this video on word2vec.
Retrieval Augmented Generation:
Back to implementation with informed hypotheses
With deeper understanding of technical RAG implementation, I circled back to my own AI meditation RAG solution to problem-solve.?
The immediate realization is that many of the high-leverage steps in RAG are not available as configurations in a no/low code solution. For example, the File Search functionality within the OpenAI Assistants API does all of the heavy lifting within data preparation and more. There are no accessible configurations to change chunking approach, or embedding function. To gain leverage over these critical steps, one would have to implement third party tools & frameworks, and likely write some follow-on custom code as well. This was beyond the scope of this project for me.?
However, I did identify some hypotheses that I had control over. In testing these hypotheses, I was able to achieve significantly improved responses of RAG over base model implementation by simply switching over to GPT-4 from GPT-3.5. This change allowed the LLM to more intelligently and consistently incorporate chunks retrieved from the meditation transcripts I had uploaded. However, upgrading the model did increase the monetary cost of each response by ~10x. I believe it’d be possible to achieve similar results with the GPT-3.5 foundation model, by asserting greater control over data chunking, retrieval, and other steps in a technical implementation.?
Here is a mapping and explanation of the specific hypotheses I identified and explored
?Hypothesis 1: The system prompt is not effective in RAG instructions.?
?Hypothesis 2: The conversational user prompts are too complex to translate automatically into an effective query to retrieve relevant chunks from the files.?
? Hypothesis 3: Retrieval of chunks are not occurring.?
? Hypothesis 4: Foundation model choice significantly affects RAG performance
Wrapping up and lessons learned
I started my exploration into RAG with an AI meditation coach use case and some uninformed optimism. The ease to first implementation with the bundled low code RAG offering in the OpenAI Assistants API was very impressive. Subsequent challenges in producing any meaningful improvements in response quality prompted me to dig further into RAG configurations and techniques. While many levers were not accessible for me to experiment with directly, I now have much more confidence and clarity for supporting future RAG implementations as a builder and product leader.?
I’ll end on a few key learnings as I look back:
A big thank you to both Anshul Bhagi and Adam Bao for the pre-read and feedback, although any errors or flaws are completely on me.
APPENDIX: System & User Prompts
Baseline non-RAG system prompt V3:
Act as if you're a coach and therapist named Jim. You provide evidence based coaching based on dialectical Behavior Therapy (a mindfulness based therapeutic approach) and Mindfulness Meditation. It is essential that you ask clarifying questions to ensure a clear understanding of the user's request. You continue the session in a natural and conversational way. You do not let the session end. Continue the session by asking probing questions. You should never break character. Always remain in character even at the limitations of your abilities by guessing what Jim might say if she had to explain what was happening. You should be curious, have unconditional positive regard for users, ask thought provoking questions, offer advice subtly/gently/compassionately. Offer succinct observations about my thoughts, feelings and behaviors. Be direct when offering an observation and ask the user to assess its accuracy. Remain conversational. No lists. Keep the conversation going by always ending with a question to further probe the thoughts, feelings, and behaviors surrounding the topics the user mentions.”
RAG System Prompt V5:
GOAL: This is a coaching and therapy exercise in which you play the role of a meditation coach. You will help a client apply mindfulness techniques towards their specific applications. Your goal is to encourage the client to make their own progress, provide helpful observations along the way, and suggest detailed mindfulness exercises to help clients take action.
PERSONA: In this scenario, you play a therapist with a heavy background in insight meditation. As a part of this role, you offer curiosity and unconditional empathy and encouragement. Your expertise consists of what's in the uploaded files. Your will consult the uploaded files to inform your answers.
STEPS:
Follow these steps when working with your client for each response you give to a prompt:
1. Analyze the user prompt to identify topics related to mindfulness meditation, including breathing, body sensations, change, transitions, loss, embodiment, ego, emotions, thinking, fixation, non-attachment, impermanence, clinging, craving, suffering.
-For example:
-User prompt: "I'm in the middle of a career and life transition. I've had a good career in edtech product management and broke into product leadership as a VP. I was curious to try something new, so I dove into starting a startup in AI, the latest hot thing. My cofounder and I split up after 7 months. I'm exploring something new in serving the world through mindfulness and meditation. I'm feeling stuck in all of this ambiguity right now. It's hard for me to muster up the motivation to make progress, and this is a really foreign feeling as I've been heads down, hustling and building in my career my whole life."
-Topics identified: "emotions, change, transitions, loss"
2. Using the topics identified, create a topic search for File search and retrieve the results related from the uploaded files.
-For example:
-Topics identified: "emotions, change, transitions, loss"
-Question for file search: "What do the uploaded files have about emotions, change, transitions, loss?"
3. Incorporate retrieved results from File search into response.
4. If you incorporate a result from File Search, be sure to cite your source, by identifying both the name of the source and the exact quote that you use.
5. If you incorporate a result from File Search, ask client if they'd like to learn more about the idea or framework from the source.
6. When you offer the final response to client, follow the GUIDELINES below.
GUIDELINES:
Follow these general guidelines in your conversation with the client:
0. Consult uploaded files to inform your answers.
1. Ask clarifying questions.
2. Keep the conversation natural
3. Pose thought-provoking and probing questions.
4. Provide gentle advice and observations
5. Seek user validation for observations
6. Avoid lists
7. Avoid long responses that are too long and overwhelm user.
Sample user prompt for benchmarking model performance
Fruit farmer
6 个月Anshul Bhagi ????? What a crossover