Building a RAG Solution using Azure AI Studio - Part 2
A lot of companies and employees have been using Large Language Models (LLMs) such as ChatGPT to help them with their work. It has helped them summarize and analyze documents, generate emails and articles, create presentations, summarize meetings, and help them create code.
However, the question arises, how can I use LLMs in a secure and efficient manner that it can answer questions based on the company's confidential data. The answer is through Azure AI Studio and Azure Open AI through Retrieval Augmented Generation (RAG).
This will be a series of articles on how to build a RAG Solution using Azure AI Studio.
Part 1 of the series was covered in this article: Building a RAG Solution using Azure AI Studio - Part 1 | LinkedIn
This article focuses on document chunking and saving the text embedding of the documents in a vector database - steps 1 and 2 of how RAG works as explained in Part 1.
Definitions
Text embedding is a process of representing text as a real-valued vector that encodes the meaning of the word. Computers love numbers. But they struggle with raw text. So, we transform words into these numeric vectors. An algorithm is used to convert the raw text into vectors. These algorithms vary based on the LLM we intend to use. In our case, we will be using text-embedding-ada-002 as this is the algorithm used by OpenAI.
Document Chunking is important because it splits very large data into smaller pieces. The documents we are talking about here are your company documents such as contracts, images, documents, financials, agreements, product manuals, and sales manuals, etc. Some of your documents would be very large, but the text embedding only has a specific size. Text-embedding-ada-002 has a size of 1536 dimensions. Thus, large documents must be chunked into smaller pieces for all the data to fit into the specific dimension. If document chunking is not done and one of your documents are too large to fit 1536 dimensions, then a lot of context would be lost.
In layman's terms, imagine trying to summarize a 1000-page document into 1 page vs summarizing 10 pages into 1 page. The latter would contain more context while the former would lose a lot of information.
Choose what storage solution to use
Going back to our RAG solution, documents must be chunked (break large documents into smaller ones) and transformed into text embeddings (vector form) by using an embedding model. The text embedding is then stored in a vector database for storage and retrieval. The question is, what storage solution shall we be using?
The answer would depend on whether your data is structured or unstructured. For structured data, my recommendation is to use Azure Cosmos DB for MongoDB vCore while for unstructured data we will use Azure AI Search.
In this tutorial, we would be using Azure AI Search as we are assuming your documents would be documents, contracts, pdfs, etc.
Azure AI Search
Azure AI Search is an AI-powered information retrieval platform that helps developers build rich search experiences and generative AI apps. It integrates with Azure storage, Azure OpenAI Service, and other Azure AI services to provide semantic, vector, and hybrid search capabilities.
The beauty of Azure AI search is that it does the document chunking and text embedding vectorization (done during document ingestion) and vector storage for you.
With just a few clicks, it does all the operations for you! To see more how this is done, you can check this link: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal-import-vectors?wt.mc_id=MVP_322781
领英推荐
Vectorizing your data steps
The following are the steps to upload your data into Azure AI Search:
Summary
In this article, we discussed how document chunking and text embeddings are important in RAG. We then did a demo on how to import and vectorize your data using Azure AI Search.
Next Step
Once you have completed the tasks above, we will then now proceed in connecting this data to Azure AI Studio.
LION/Open Networks/DHL |Investor |Data Science |Data Visualization |AI |Data Analysis |Business Analytics |Machine Learning lTableau |BigData lStatistics |All friend request and offers are accepted at [email protected]
6 个月Ziggy... My Azure does not look like the instructions... I am totally lost here...d Need Help