A Beginner’s Guide to Retrieval-Augmented Generation (RAG)

A Beginner’s Guide to Retrieval-Augmented Generation (RAG)

What is RAG?

RAG - Retrieval-Augmented Generation, simply put, allows you to chat with your data. Does that sound overly abstract?

Okay, let’s break it down — you’ve probably used ChatGPT (please don’t tell me you’re living under a rock). So how does ChatGPT know what to say in response to your queries? Two things: the vast amount of data it’s been trained on and its ability to generate human-like sequences — the AI magic. (Don’t worry, I won’t bore you with the technical details; I’ve added a reading resource for those interested.)

But what happens if you ask ChatGPT a question about a project or data that’s personal to you or your work? It would either:

  • Not know how to answer and tell you so (honestly, this is the better outcome), or
  • Try to make something up or “hallucinate” a response to sound convincing — yeah, not ideal if accuracy is important to your use case.

So, what’s the solution? Naturally, you’d want to “augment” the LLM’s knowledge to ensure its responses are grounded in reality and factually correct based on your data.

There are a few different ways to achieve this, but we’re most interested in Retrieval-Augmented Generation (RAG). Here’s how it works: You “Retrieve” the most relevant data for the user’s query from your database, use this data to “Augment” the prompt you send to the LLM, and then let it “Generate” a response based on the user query + prompt + retrieved data.

Et voilà! You’ve successfully injected your data and combined it with the power of an LLM to produce factually correct responses based on your proprietary data. And now, you know what RAG is! ??

RAG Architecture

Components in a RAG application

There are two main components to this technique: the “Retrieval” and the “Augmented Generation.” Let’s discuss each briefly.

Retrieval System

By now, you might understand that there’s no RAG without a retrieval system, so let’s discuss what the ambiguous term “retrieving the most relevant data” actually means.

Imagine you have a clothing and apparel store. You maintain a product catalogue, and now you’re tasked with implementing a chatbot to allow users to ask questions about the kind of products they’re looking for (style, size, colour, type, material—you know, the works).

An example user query: Can you show me red sneakers in size 10 that are made of leather? (Wow, bold choice.)

For the chatbot to answer this, you’d first need to retrieve the products from your catalogue that best match the user query. That’s where the retrieval system comes in. But how do you do that? You’d need to perform what’s called a “similarity search” on the product catalogue for the user query.

To perform a similarity search, you’d preprocess both the user query and the database to represent them in a way that allows you to use an appropriate statistical technique to calculate the distance between the two.

In RAG, this is arguably the most critical step to optimize for accuracy. The quality of the response depends on several factors: the choice of data representation, the amount of data you have, how you split it, the similarity measure used, and the strictness of your similarity search limits, among others.

Here are two examples to illustrate why optimization matters:

Correct Semantic Retrieval Example:

The system retrieves:

  • "Red leather sneakers, size 10, $75"
  • "Red sneakers, size 10, synthetic leather, $60"

Incorrect Retrieval Example (Bad Semantic Search):

The system retrieves:

  • "Red cotton t-shirt, size M, $20"
  • "Blue leather sandals, size 10, $50"

Hence, it’s crucial to represent data in a way that captures not just the words in the documents/data but also their semantic meaning. For example, the representations of “queen” and “king” should be semantically closer to each other, while “king” and “ring” should be farther apart—even though the words are similar in letters. These representations are called embeddings (I’ve added resources below to help you understand what embeddings are and the different types you can use).

Augmented Generation

Now that you’ve retrieved your data, it’s time to send it along with the user query and the system prompt to the LLM for it to generate a coherent response. Let’s see how this would look in our chatbot example.

This formatted prompt is then sent to the LLM, which generates the final response for the user.

LLM Response:

"Here are some options for red sneakers in size 10:

  • Red leather sneakers for $75.
  • Red sneakers made of synthetic leather for $60.

Let me know if you’d like more details or help choosing the perfect pair!"

At this stage, you can improve the quality of your output by tuning the system prompt. For instance, in the snippet above, the prompt is designed to handle incorrect or missing retrievals, making the final responses more robust.

?? Congratulations! You’re no longer a noob when it comes to RAG.


Further Reading Resources


Ajay Balamurugadas

Creative Problem Solver, Passionate to learn, love to motivate people to achieve their goals

2 个月

Excellent. Well explained. Please make a series on these topics

要查看或添加评论,请登录

社区洞察

其他会员也浏览了