Retrieval Augmented Generation with LLM- HOW?
A good design for Retrieval Augmented Generation (RAG) component is key while implementing Generative AI use cases. The quality of output depends on how good the input or user query (prompt) is to the Language model. Additionally, it is important to have this RAG system scalable and with high processing performance.
So, how do you build a RAG?
In its simplest form, you can just attach the data to the user query before you send it to LLM for processing. For example, if your user query should return results from the data available in a document, you could include all the text from that document as part of the query, which then acts as context that LLM will use while generating the output. This may be okay if you have just one document. But would this work when your dataset is large and diverse? For example, your result may reside in hundreds of policy documents, or it may be available in other unstructured formats like videos and audio. There is only a limited context window size for any LLM. The context window size defines the maximum length of the user query that the LLM would allow. It may also be expensive to have the query size larger since the pricing for some of the available LLMs are based on the token size (number of words in the request and response).
There are many reference architectures evolving to solve this issue. One such solution involves converting the data as "Embeddings". Embedding is a "Vector" representation of a word (or words) which enable efficient searching when hosted on a vector database. Vectors can be plotted on a multi-dimensional Semantic feature space.?Words / Text with embeddings similar to each other will be closer to each other on the semantic space and will represent the same context. The idea is to be able to identify a very small subset of data that is most relevant and is the real context for the user query (i.e. similar to the embedding of the User Query).?
Following are the high-level steps used to build a RAG.
Step 1: Chunking: In this step you split the data into smaller segments or "Chunks". Chunks can be created using simple technique of equally splitting based on a preset size or can use a more sophisticated technique using data classification.?
领英推荐
Step 2: Embedding: the next step is to generate the embeddings for these chunks. There are different embedding models available including many open-source libraries. (OpenAI has its own embedding model which can be accessed over APIs.). You may want to test these against your data set and see which ones provide the best results while searching.
Step 3: Vector Store: Once you have the embeddings for the chunks, these embeddings are then sorted on a vector store (Vector database).
The above steps should be repeated for any data changes or new data. This will allow for dynamic context based on evolving data.
The following steps are used before a prompt or user query is sent to the LLM.
Step 1: User Query Embedding: Generate embedding of the user query.
Step 2: Search: Search Vector store to retrieve chunks that have matching or similar embeddings.
Step 3: Merge: In this step you merge the retrieved chunks and combine it with the user query (prompt). This enriched prompt then not only includes the user query but also includes the context that the LLM should use for the results.
Google Cloud Professional Architect, AWS Architect, AI, DL, ML, Analytics
11 个月Impressive Karan Sehgal
Investor looking to purchase businesses doing at least $200k in EBITDA
11 个月Impressive insights! Looking forward to learning more about building a scalable and high-performance RAG system. ????
Crafting Audits, Process, Automations that Generate ?+??| FULL REMOTE Only | Founder & Tech Creative | 30+ Companies Guided
11 个月Building a scalable and high-performance RAG system is crucial for generating quality output. ?? #AI #generativeai