Start chatting with your enterprise knowledge base with a LLM

Start chatting with your enterprise knowledge base with a LLM

Key topics covered in this article:

  • Need for retrieval augmented generation
  • Describe retrieval augmented generation
  • Advanced retrieval techniques for better response generation
  • Evaluation of response on Answer relevance, Context relevance, and Grounded-ness

Introduction

ChatGPT from OpenAI, Bard, from Google and Llama-2 family of Large Language models has fueled interest in GenAI. Many enterprises are looking forward to use its knowledge base with available LLMs and to generate responses that are specific to the enterprise and have context of its knowledge. Augmented retrieval generation enables this use case. First understand the limitation of a LLM and how RAG (Retrieval Augmented Generation) enables us to overcome them.

Large Language Models are trained on publicly available information and have a knowledge cutoff date

A Large language model is pretrained on a corpus of texts sourced from publicly available information on internet. For example, Common crawl is an open-source repository of web crawl data that is used extensively along with other publicly available resources. Once pretrained, a LLM goes through fine tuning that involves Supervised Fine Tuning and RLHF coupled with Safety training to become usable for chat applications, summarization, content generation etc. As a result, a LLM trained on a public data will not have information on internal knowledge base. A LLM also has a long training duration and will not have information of any event after its knowledge cut off date.

Retrieval augmented generation enables use of Enterprise data. The whole process can be divided into 3 steps -

  • Transform the data to make it usable along with a LLM
  • Retrieve the relevant data chunks based on the query supplied by the user
  • Send the retrieved chunks and the query to a LLM to generate a relevant response

Transform the data to make it usable along with a LLM

In an enterprise, most of its knowledge base contains documents such as pdf, word documents, markdown files, HTML files and so on. All such documents need to be split into smaller chunks, embedded and stored in a vector database.

Document Transformation - Image by author

Splitting the documents

First step is to load the data and split these documents in smaller chunks. These smaller chunks are searched to retrieve relevant information as per the query. One important thing is to store a document’s meta data along with these chunks for enabling better contextual retrieval at a later stage. All these chunks are overlapped to have a consistent context from one chunk to another.

Embedding

Once a document is split into smaller chunks, these chunks need to be embedded. Embedding is a process of creating the numerical representation of a chunk. These numerical representations are called Vectors. Cosine similarity is used to retrieve vectors similar to the user query. There are many embedding models available that can be used to embed these chunks.

Vector DB

So far, we have broken text into smaller chunks and embedded. We need to store them for eventual retrieval. A vector database is used to store these embeddings. A vector data base uses k nearest neighbor and related algorithms to efficiently retrieve relevant vectors. Some of the vector database providers are Chroma, Pinecone, Weaviate, Elastic Search, and redis.

Retrieval

Once documents have been transformed and stored in a vector database, we need to retrieve them. A query is passed to the vector data base and matching chunks are retrieved.

Retrieval process - Image by author

Similarity Search and Maximum Marginal Relevance Search

There are two ways for retrieval, similarity search and maximum marginal retrieval search. Similarity search will get document chunks that are the most similar. Maximum Marginal Relevance Search retrieves diverse chunks and returns chunks that are not similar to the query but still relevant. A parameter defines numbers of chunks that will be retrieved and later k documents will be selected from the top and as well from the bottom most.

Use of metadata in retrieving a relevant document

At times, there will be queries that will have question around a specific document and a better retrieval will happen if the metadata context can be provide while querying. Adding a filter to the query that defines what source needs to be looked at is one of doing that. Using a LLM to retrieve the context from the query and then pass that along with query is also used for an efficient use of metadata.

Response generation

Now any query posted by a user is passed with retrieved context to the LLM. LLM makes use of supplied information in its response. A prompt template can be used to instruct LLM to avoid hallucination such as “You are a helpful assistance and if you do not the answer, respond by saying that you do not know the answer.

Response generation - Image by author

Improving retrieval by query expansion

Using similarity search for retrieving context from a vector data base has a downside, it does not retrieve chunks that are relevant for the answer as they may not be similar to query/question vector. We can overcome this shortcoming by using below techniques.

Query expansion by adding a hypothetical answer

One of the solutions of above problem is to ask LLM to generate a hypothetical answer before sending query for retrieval. A hypothetical answer generated by LLM is appended with the user query and sent to vector database for retrieval relevant of document chunks. Suppose you are analyzing a financial report and pose a question about the total revenue or any sector specific query. We can generate a hypothetical answer from a LLM and append into the query and sent for retrieval for better context. Retrieved documents are then sent to LLM for processing the answer.

Query expansion by adding similar queries

Another simple approach is to send the query to LLM and instruct LLM to generate 'n' number of similar queries in simple sentences. Once similar queries are generated send them to retrieve context and send the context to LLM for response generation.


Evaluating responses generated by LLM

At the core, a LLM is a next word predicting machine and there is no guarantee that response will be relevant, contextual, or grounded. TruLens provides feedback functions that can be uses for evaluating a response on all three parameters.

TruLens framework - Image by author

  • Answer Relevance

How relevant the answer is for a given query? Here a LLM evaluated answer of the query and score it.

  • Groundedness

Is the answer based on the retrieved context? Here a LLM evaluates provided context and answer and score it.

  • Context Relevance

How relevant the context is for a given query? Here a LLM evaluates context to the query and score it.

Conclusion

Retrieval augmentation is evolving and newer techniques to make it better are being introduced. An advanced retrieval system can enable an effective and friendly use of internal knowledge base and thus initiate a move away from the current keyword-based retrieval systems in place.

要查看或添加评论,请登录

Pradeep Goel的更多文章

社区洞察

其他会员也浏览了