A Beginner’s Guide to Retrieval-Augmented Generation (RAG)

What is RAG?

RAG - Retrieval-Augmented Generation, simply put, allows you to chat with your data. Does that sound overly abstract?

Okay, let’s break it down — you’ve probably used ChatGPT (please don’t tell me you’re living under a rock). So how does ChatGPT know what to say in response to your queries? Two things: the vast amount of data it’s been trained on and its ability to generate human-like sequences — the AI magic. (Don’t worry, I won’t bore you with the technical details; I’ve added a reading resource for those interested.)

But what happens if you ask ChatGPT a question about a project or data that’s personal to you or your work? It would either:

Not know how to answer and tell you so (honestly, this is the better outcome), or
Try to make something up or “hallucinate” a response to sound convincing — yeah, not ideal if accuracy is important to your use case.

So, what’s the solution? Naturally, you’d want to “augment” the LLM’s knowledge to ensure its responses are grounded in reality and factually correct based on your data.

There are a few different ways to achieve this, but we’re most interested in Retrieval-Augmented Generation (RAG). Here’s how it works: You “Retrieve” the most relevant data for the user’s query from your database, use this data to “Augment” the prompt you send to the LLM, and then let it “Generate” a response based on the user query + prompt + retrieved data.

Et voilà! You’ve successfully injected your data and combined it with the power of an LLM to produce factually correct responses based on your proprietary data. And now, you know what RAG is! ??

RAG Architecture

There are two main components to this technique: the “Retrieval” and the “Augmented Generation.” Let’s discuss each briefly.

Retrieval System

By now, you might understand that there’s no RAG without a retrieval system, so let’s discuss what the ambiguous term “retrieving the most relevant data” actually means.

Imagine you have a clothing and apparel store. You maintain a product catalogue, and now you’re tasked with implementing a chatbot to allow users to ask questions about the kind of products they’re looking for (style, size, colour, type, material—you know, the works).

An example user query: Can you show me red sneakers in size 10 that are made of leather? (Wow, bold choice.)

For the chatbot to answer this, you’d first need to retrieve the products from your catalogue that best match the user query. That’s where the retrieval system comes in. But how do you do that? You’d need to perform what’s called a “similarity search” on the product catalogue for the user query.

To perform a similarity search, you’d preprocess both the user query and the database to represent them in a way that allows you to use an appropriate statistical technique to calculate the distance between the two.

In RAG, this is arguably the most critical step to optimize for accuracy. The quality of the response depends on several factors: the choice of data representation, the amount of data you have, how you split it, the similarity measure used, and the strictness of your similarity search limits, among others.

Here are two examples to illustrate why optimization matters:

Correct Semantic Retrieval Example:

The system retrieves:

"Red leather sneakers, size 10, $75"
"Red sneakers, size 10, synthetic leather, $60"

Incorrect Retrieval Example (Bad Semantic Search):

The system retrieves:

"Red cotton t-shirt, size M, $20"
"Blue leather sandals, size 10, $50"

Hence, it’s crucial to represent data in a way that captures not just the words in the documents/data but also their semantic meaning. For example, the representations of “queen” and “king” should be semantically closer to each other, while “king” and “ring” should be farther apart—even though the words are similar in letters. These representations are called embeddings (I’ve added resources below to help you understand what embeddings are and the different types you can use).

Augmented Generation

Now that you’ve retrieved your data, it’s time to send it along with the user query and the system prompt to the LLM for it to generate a coherent response. Let’s see how this would look in our chatbot example.

This formatted prompt is then sent to the LLM, which generates the final response for the user.

LLM Response:

"Here are some options for red sneakers in size 10:

Red leather sneakers for $75.
Red sneakers made of synthetic leather for $60.

Let me know if you’d like more details or help choosing the perfect pair!"

At this stage, you can improve the quality of your output by tuning the system prompt. For instance, in the snippet above, the prompt is designed to handle incorrect or missing retrievals, making the final responses more robust.

?? Congratulations! You’re no longer a noob when it comes to RAG.

A Beginner’s Guide to Retrieval-Augmented Generation (RAG)

Shivani Virdi

Engineering at Microsoft | Simplifying AI for Everyone | Empowering Productivity with Proven Frameworks and Processes

What is RAG?

RAG Architecture

Retrieval System

领英推荐

Augmented Generation

Further Reading Resources

社区洞察

其他会员也浏览了

From Data Quality to AI LLMs limitations and issues

Select avg(Moby Dick) limit 2 sentences

Garbage In and Garbage Out: Managing Data in the AI Era

THE SLIP: "What If?" causal models are a fast way to de-risk your 2025 plan. And it's never been easier, thanks to OpenAI and Proof Analytics

Unpacking Data Products in the AI Age

Unlocking Advanced Data Analysis for SWMM5 with ChatGPT's Beta Version of GPT-4: A Possible Game-Changer for Students and Beginners

The Entanglement Problem: How Data Bias and AI Model Drift Reinforce Each Other

Meet Secoda AI: The fastest way to get answers from your data

Unlocking Data Analytics: Next-Level Prompting Techniques with ChatGPT

Dear Qlik, let's make a connection. Yours truly, chatGPT ??