My First RAG Use-Case - Key Insights & Lessons Learned

My First RAG Use-Case - Key Insights & Lessons Learned

Introduction

Upon the introduction of ChatGPT, businesses were eager to discern its implications and explore ways to utilize it for a competitive edge. ChatGPT, despite its prowess, lacked innate knowledge of individual company data. This prompted organizations to ask the critical question: "How can we tailor a ChatGPT-like experience to our unique data sets?" The initial solution, fine-tuning, involved training the foundational GPT model on specific data, a task necessitating substantial GPU resources, high-quality data, and a dedicated team of data scientists and machine learning engineers—requiring considerable investment of time and capital without guaranteed success, thus presenting a high barrier to entry for most organizations.?

A much more accessible alternative emerged called "Retrieval Augmented Generation" or RAG. RAG operates by retrieving the most pertinent data in response to a query and integrating it with ChatGPT's context, enabling the model to generate precise, conversational responses based on proprietary data. This approach significantly reduces the barriers associated with fine-tuning.


While RAG is a potent tool, it comes with distinct challenges during implementation. In this post, I delve into the primary obstacles encountered during my first substantial RAG project for a major client and discuss the strategies we adopted to address them.

?


The Project Objective?

The project aimed to develop a generative AI chatbot tailored for executives, finance teams, and the investor relations department. These groups often navigate various data sources to address two main types of inquiries:?

??

Strategic, Macro-level Questions:?

??

- Company performance insights and key metrics, along with market perspectives?

- Strengths, weaknesses, and potential scenarios for the company's future?

- Financial, market risks, and performance analysis by segment, industry, or region?

- Market sentiment and competitor benchmarking?

- Analysis of market trends and their potential impact on the business?

?

Detailed Financial Inquiries:?

??

- Revenue tracking over the past six quarters, segmented by industry and location?

- Full fiscal year margins, earnings per share (EPS), and year-over-year growth metrics?

- Analyst price targets and consensus forecast estimates?

- Breakdown of operational costs and expenses over recent fiscal periods?

- Capital structure analysis, including debt levels and financing activities?

?

The chatbot is designed to expedite critical scenarios, such as preparing for investor meetings, streamlining the pre-earnings call process, and assessing competitor health and market strategy—all with speed and efficiency. This would replace the laborious, manual process of data retrieval and synthesis with an instantaneous solution, transforming hours or days of work into seconds.

?

As an example, let’s say an executive needs to prepare for a meeting with an investor to talk about financing. The investor relations team would need to do research, put together a briefing, and work with the executive to make sure they are ready. All-in-all, this would take multiple people multiple days, if not more. With the chatbot, the executive can self-serve the information they need in a matter of minutes.

The data sources for the project included SEC filings, earnings call transcripts, analyst reports, and internal financial documents.


The Tech

Microsoft teams frequently release "Accelerators," which are pre-built, fully coded solutions designed for rapid and straightforward customization by clients. For this initiative, we utilized the Azure Search OpenAI Demo Accelerator—a fully functional web application that can be tailored with your proprietary data sets. The solution integrates the following Azure services:?

?

Azure AI Search - Formerly known as "Azure Cognitive Search", this serves as our intelligent search engine and data store.?

Azure OpenAI - This is the platform where we deploy our advanced language models, including GPT-4-32k and GPT-4-Turbo, alongside text-embedding-ada-02 for creating vector embeddings (explained below).?

Azure App Service? - This underpins our application, coordinating the backend processes and presenting the user interface of the web application. Python was the language of choice.


Before diving into the challenges discussed in this post, it's crucial to grasp the fundamentals of information retrieval from our data store. The traditional search approach hinged on keywords, searching for data containing specific terms from the query. However, a newer methodology known as "vector search" has gained popularity. This technique interprets the meaning behind words by transforming them into numerical patterns. This allows the computer to identify how close the content is in meaning to your query, offering results that are semantically related even without exact word matches.?

Vector search is akin to a conversation where the search tool understands the essence of your request, while traditional search is comparable to referencing an index – it's dependent on exact word matches. Each search method has its advantages and drawbacks.? What is Vector Search? A Comprehensive Guide

The default search algorithm utilized by this accelerator is "hybrid search," which concurrently conducts both vector and keyword searches, amalgamating the most effective results from each approach.?

??

It is also important to understand the concept of “Document Chunking”. Why do we need to break our source documents into “chunks”?

1.?????? LLM’s have a finite context window, there is a limit on how many input tokens we can pass into it. Most documents are longer than the max context window.

2.?????? Chunking helps us make sure we only provide the most relevant content to the LLM, improving the overall response quality and relevance. ?

?

When documents are processed by Azure AI Search in our system, they are segmented into chunks represented as vectors. During a vector search, the engine attempts to align the user's query with these chunks, pinpointing those with the highest semantic resemblance. Optimal chunk size can vary, but typically, a range of 500 to 1000 words provides a balance for performance. In our case, documents were divided into chunks of 1000 words each.

?

Here is a visual representation of what a typical RAG architecture might look like:

Here is a quick recap of key concepts & terms before we proceed:

  1. LLM - large language model (GPT4 in our case)
  2. Context Window - The "short-term memory" of an LLM. An LLM accepts input text and then generates output. The context window is how many tokens (1 token = 1 word approximately) can fit in the input.
  3. Vector Embeddings/Vector Search - Vector embeddings are numerical representations of data, such as words or phrases, that capture their meaning and context. Vector search leverages these embeddings to find the most semantically similar items in a dataset by measuring the distance between their vectors. As an example, the vector representations of the word "bird" and "owl" are going to be very close together in the "vector space" because they are semantically similar (they represent small flying animals), even though they are distinctly different words. In contrast, "owl" and "spoon" are going to be quite different far apart in the vector space. They are not closely related in meaning.



The Challenges

The beauty of the accelerator was that we were able to deploy it for the customer in a matter of days and deliver a “wow” moment for them right out of the gate. They were amazed at the ability to chat conversationally with their own data. However, they quickly started to notice some problems.

?

Challenge #1 - The right content, but the wrong context

A common question would often be “What is X company’s latest commentary on Y company?”. Let’s look at the output of the chatbot when asked “What is Contoso Equity Research’s latest commentary on us?” (company named replaced with XXX for privacy reasons):

At first glance this seems like a great reply. It is detailed and provides numerous citations. However, note citations #5 and #6. In the last line of the response, a comment made by Northwind Capital is being incorrectly attributed to Contoso. In the first line of the response, the 12% revenue growth number is related to YYY corporation, not XXX like we requested. What happened here?? Let’s look at citation #5:


This is a page of Northwind’s report on XXX. Why is our search engine returning this when we asked about Contoso, and why is GPT4 thinking that Contoso said this?

Recall our process for breaking up the documents into “vectors”. We are breaking up the document not by page, but by an arbitrary number of tokens (words). This page is actually being broken up into two different chunks:


These chunks are stored separately in our search engine. We can see that the second chunk no longer contains the header “Northwind Capital”. This header is critical to understanding the context. Without it, its not clear who the author of the content is, and that is required to answer the question correctly (any content we return MUST be said by Contoso). Our vector search is failing in this instance. It is finding a content chunk that has the right content (analyst commentary on XXX), but the wrong context (Northwind Capital is the author, not Contoso Equity Research). ?When we pass this content to GPT4, it does not have the proper context and is just incorrectly assuming that Contoso is the speaker.?

So what happened with citation #6? Why are we including content from a report about YYY company in our response? Let’s look at a page from the report that was citated:

Recall that we are using hybrid search, where both vector search results and full-text, key word results are returned. This is an example of our full-text key word search failing. Despite the fact that this page is talking about YYY company, we see the terms “Contoso” and “XXX” appear in the text (see footer), so our key word search is returning it as a top-ranked search result. The footer is essentially “junk”, but it is being picked up by our key-word search. ?GPT4 is seeing this content chunk and incorrectly assuming it is talking about XXX corp.

We needed to figure out a way to make the system prioritize context over content. For these type of questions, the context (who is doing the speaking, who is the subject) is more important than the content (what is being said).


Challenge #2 – Finding the Correct Financials ?

Another common question was looking up specific financial metrics. These metrics often sit within tables inside of reports, filings, and internal PDFs. The users quickly noticed that the system was often getting the answer wrong when asked basic questions like “What is our total FY2023 revenue?”. Let’s look at an example table to understand why:

For this company, total revenue = service revenue + product revenue. For some reason the system was returning the service revenues total instead of the overall total revenue numbers at the bottom of the chart. When we investigated, we found that only the top half of the table was being returned in the search results and passed to GPT4.

This was once again due to the fact that we were splitting up the document by an arbitrary number of tokens (words). The chunk that contained the bottom half of the table was not being returned in the search results as it didn’t have any column headers (it was just rows).

?It was becoming clear that we needed to rethink the way we were doing our “chunking” of documents. The “default” approach that is commonly suggested as a starting point for RAG applications was not working for this use-case.


Challenge #3 – Vector Search vs Key Word Search

As we delved deeper into our search process, it became clear that both vector search and key word search were not performing well. Key word search was returning mostly pages that contained legal disclaimers, disclosures, and financial analyst certifications. For example, if we searched for Contoso Equity Research’s commentary on a company, the page from the report that had the most instances of the term “Contoso” would often times be the disclaimers & certifications page. Here is an example:

Because it had the most relevant key word frequency (note the many instances of “Contoso”), it would be ranked highest in the search results even though it was completely devoid of any meaningful content.

For vector search, recall that we are attempting to match the user query against the content in terms of semantic similarity. Aside from simply the issue of missing context described above, in our scenario the question was often quite different from the answer. If we are searching for FY2023 revenue, the vector representation of a table (what we are searching for) is going to be quite different from the user input (the search query). If we are searching for Contoso’s commentary on our company, the actual commentary we are looking for might be quite different from the search query in the vector space.

?

Challenge #4 – Understanding Time

The system did not have any grasp of what users meant when they asked it to “focus on the latest data”. When analyzing a company, you typically care about what the current state is or what the most recently published data available says. You do not typically care about numbers from 3 years ago. When asked to focus on a particular fiscal year or fiscal quarter, it would, but users did not have the fiscal calendar of all their competitors memorized and it was tedious to look up in many cases. We needed to figure out a way to prioritize search results based on date.

?

?Challenge #5 – Questions about multiple companies

Users would often ask a question about multiple companies. “Tell me what the following analysts are saying about XXX: Contoso, Northwind, JP Morgan, Citigroup, Susquehanna, TD Cowen, Stifel, Wolfe Research, RBC Capital Markets, Bank of America”. This type of question posed a challenge for the default architecture. By default, we return the top 5 most relevant content chunks for a given question, and then feed that to GPT4. When we ask about 8 companies but are only returning 5 pieces of content, we are never going to get a full answer. If we bump the number of content chunks returned up to 15,20,25, etc , we will likely hit the context window limit of GPT4. Even if not, it is not a guarantee we are going to find content for each company in our search results.

?

Challenge #6 – Filtering

Users wanted the ability to “filter” on a specific category of data. In the finance world, certain sources are more reliable than others. A company’s 10k filing, for example, would be considered a more reliable source for a company’s financials than a report published by an analyst (the former is audited, the latter is not). Users would also want to reference a particular document and were frustrated that asking to focus on a particular document in their question did nothing. This was because the name of the file was not a “searchable” element in Azure AI Search. It was only searching on the content itself, which did not contain the file name. ??

?

Challenge #7 – Lack of Detail

Users had to continuously ask for “more detailed” or “please be more detailed”. The default behavior of GPT4 was to answer with a concise paragraph. For the user base, this was not what they were looking for. They would rather start with more details.

?


The Solutions

Reworking our Chunking Strategy

Given that more than one of our problems stemmed from how we were breaking up the documents, we felt the first logical first step would be to redesign our document chunking process. Instead of breaking?them up by an arbitrary number of words, we decided to break them up in a way that preserved important context. We opted to chunk by page, as often times important context was captured in the header or footer of a given page of a document. ?

This resulted in the following improvements:

1.?????? Tables were no longer being broken up, and the ability to correctly retrieve specific financials improved drastically. There were some documents that had multi-page tables, but they were so infrequent we decided to not address it given the time constraints and other priorities. Overall, the ability to find the right table and extract the right information from it was greatly improved.

2.?????? Instances of incorrectly attributing a quote from one company as a quote from another company, or incorrectly assuming a chunk of content was about one company when it was about another, went down. As we expected, the header and footer information of each page provided valuable context to improve the search results.


However, there were still some problems that persisted or were even worsened.

1.?????? It exacerbated the problem of having the legal disclaimers & disclosures showing up at the top of the search results. Because these pages had the most key words on them, they were now almost always showing up at the top of the search results, leading to poor answers.

2.?????? Vector search was now essentially useless. The vector representation of a full page of content is going to be too different from the vector representation of the users query to ever provide a meaningful match. We were now using a 100% key word, full text search.

3.?????? The footer at the bottom of every page of every document that said “This report is intended for [email protected]. Unauthorized distribution is prohibited” continued to throw off the search results. Given that we were now putting even more emphasis on key word search, pages about XXX company would show up even if we were searching for YYY company. ?

?

Pre-processing & Chunk Validation

The next two changes we implemented were a pre-processing step & a step we called “content chunk validation”.

Preprocessing – We added logic into our indexing process that removed certain footers from all pages before the indexing itself. The idea was that, given certain footers were polluting our search results, let’s just remove those footers. However, we realized that the footers were not constant, the next quarter’s reports all had different footers. Ultimately, we decided this approach was not going to be successful long-term so we abandoned it. ?

Content Chunk Validation – The idea behind content chunk validation was that we would use a separate call to GPT4 to “validate” that our search results were legitimate prior to passing them to the “main” GPT4 instance that synthesized the content into a final answer. Here was the prompt we used:

 CHUNK_VALIDATOR_PROMPT = """
    You are the chunk validator agent for a financial chatbot.
    You are given a question and N content "chunks", and it is your job to validate whether the chunks are relevant to the question. 


    # Instructions:
    -Follow these steps exactly. 
    -Step 1: Identify the "content chunks" you have been provided. These are extracted parts from one or multiple documents. Any time you see "Content: " , that delineates the start of a chunk. A source [] is the end of a chunk. 
    -Step 2: CHUNK_VALIDATION: For each content chunk, determine whether it is valid or not. Output your decision and reasoning. 
    To be considered valid, it must be both relevant to the user's question and attributable (can we verify who said it?). Some questions might not need to be checked for attribution, but all need to be checked for relevance.

    Your output formatting must be as follows:

    1: <reasoning>  - CHUNK_VALID
    2: <reasoning>   - CHUNK_INVALID
    N: <reasoning>   - CHUNK_VALID

    You MUST be 100% certain of your decision. If you are not sure, mark it as invalid. Take your time and think it through. For a chunk to be attributable, you MUST be able to verify who said it. 
    For example, if the user asks what <company A> is saying about <company B>, and you don't see <company A> mentioned anywhere in the content, you can't verify who said it, so it is not attributable.
    If attribution is important to the question, you MUST explain how you can tell who said it. Do not check for attribution if it is not important to the question.

    """        

Our first version of the prompt simply had it go through and output “Valid” or “Invalid”, and it surprisingly did not work well at all. We had to ask it to explain its reasoning prior to making its final decision in order to get reasonable accuracy. This was an interesting firsthand example of how important “chain of thought” prompting was in getting the right answer.

This approach helped filter out bad content, but it had its drawbacks. It added between 15-20 seconds of latency to the overall response time between when the original question was asked and when the answer was returned to the user. It was also in essence a “band-aid” solution to mask poor search results. We were using extra compute/tokens that wouldn’t be necessary if we could just get the right search results in the first place. Ultimately, we decided to abandon content chunk validation as a solution.


?

The Breakthrough – Re-thinking Indexing & Search

At this point in the project, we had only seen very marginal improvement in the quality and accuracy of responses. We needed to address the foundational problem: “How do we get the right search results?”.

The solution we came up with was the following:

1.?????? Use GPT4 to generate a summary of each page.

2.?????? Add the summary as a new, searchable field in the Azure AI Search index

3.?????? Vectorize the summary instead of the actual content

4.?????? Match the user’s query against the summary field instead of the content field

?

Here is the prompt we used for our summarization (companies swapped out for privacy):

self.prompt = """You are a financial summary AI. It is your job to read a page of a financial report, and write a very brief summary. Pages typically fall into one of four categories:

        1. Commentary by one company on another company
        2. Disclaimers, disclosures, and analyst certifications
        3. A page from a company earnings transcript or filing
        4. A page from an internal financial document


        <Guidelines>

        1. For commentary, structure your response like this: "<Company> commentary on <company> for <fiscal year/quarter>. Key point 1, key point 2, key point 3." 
        2. For disclaimers, structure your response like this: "Legal disclaimers, disclosures, and analyst certifications for <company>"
        3. For pages from earnings transcripts or filings, structure your response like this: "<Company> <title/subject> <fiscal year/quarter>. Topic 1, topic 2, topic 3"
        4. For pages from an internal financial document, structure your response like this: "<Company> <title/subject> <fiscal year/quarter>. Key metrics 1, key metrics 2, key metrics 3." ". 
        
        Keep your responses extremely brief and concise. Do not name specific analysts who wrote the report, only the entities. Do not provide any specific details, only what is covered in the page. 
        Only 1 short sentence summary that includes important entities and the fiscal period. The pages are tyically either financial commentary, financial metrics, or legal disclaimers. 
        You should always mention the fiscal period if its mentioned, but you don't need to specify if you can't find one. 
        
        Try to match your output with what someone would search for in a search engine if they were looking for this page. Instead of "This page is Contoso's commentary on XXX's Q3 FY2024 results. 
        Includes price target and downside risk.", you should say "Contoso commentary XXX Q3 FY2024. price target and downside risk". 
        

    """        

Note the line towards the bottom that says “Try to match your output with what someone would search for in a search engine if they were looking for this page”. This was the essence of our approach. When a user enters a search query, the vector representation of that search query is going to be very, very close to the vector representation of the page summary, and its going to return a very high search score.

Our initial testing of the approach was extremely promising. There were still a few issues.

1.?????? Searching on the summary field was performing extremely well for questions where context was important (analyst commentary, overall performance, macro trends). For specific financial metric retrieval, searching on the content field was still superior. So, we were still searching on both fields.

2.?????? Because of #1, disclaimers & disclosure pages were still showing up in the search results.

3.?????? Some files, such as 10k filings or earnings transcripts, didn’t have all the necessary context on a given page. Take Microsoft’s earnings call transcript as an example. On a given page, the word “Microsoft” might not be mentioned. In this scenario, that page is missing context (what is the subject).

?

Our next set of enhancements aimed to address these points and further refine the approach. We implemented the following:

1.?????? A decision “agent” that was responsible for determining what was more important for a given question: content or context. This “agent” was just a separate call to GPT4 with the following prompt:

Given a user query, you must determine what type of search to run. There are 3 types of searches:

    1. "Context" search - This is a search where context matters more than content. An example would be a question asking what a specific analyst is saying about a specific company for a specific time period. 
    Here, the context (what company is doing the speaking, which company is the subject) is more important than the content (what they are saying). Broad questions ("How is MSFT performing") are typically context searches.   
    2. "Content" search - This is a search where content matters more than context. An example would be a question asking for a specific financial metric or specific financials. (e.g. "What is MSFT's FY2024 Q3 Security Revenue?")
   

    Provide your output in this format: 

    Query_Type: <Context/Content/NA>
    Search_Query: <search query>
        

If the agent determined that context was more important, we would run a “context-focused query” (a search on the “summary” field of our index). If content was more important, we would run a “content-focused query” (a search on the “content” field of our index).

1.?????? We added a filter on all searches that filtered out results where the world “disclaimer” showed up in the “summary” field. This got rid of all junk pages from our search results, greatly boosting answer quality & richness.

2.?????? We passed the filename into the summarization function for additional context. If we take Microsoft’s earnings call transcript as an example, even if the word “Microsoft” doesn’t appear on a given page, if we give GPT4 the filename (MSFT Earnings Q3 FY24.PDF), it can now understand the subject. ?We updated the prompt and told GPT4 to always check the filename as it included important context.

3.?????? We realized that more context is always better, so we decided to pass the filename, the summary, and content of each “chunk” to the final GPT4 step that synthesizes the final response. This way, even if a bad chunk slips through somehow, GPT4 can determine from the filename and summary that a given chunk isn’t relevant to the user’s query, and it will leave that content out of its response.

?

These changes delivered another “wow” moment for the client. Almost overnight (from their perspective), the responses became more detailed, accurate, insightful, and higher quality. The users were extremely happy with the drastic improvement they were seeing. While there is of course still room for improvement, the main, fundamental challenge of “how do we get the right search results?” had been officially solved. We officially removed the “content chunk validation” step that had been implemented earlier on.


Understanding Time

While the main challenge was resolved, there were still other challenges we aimed to address. In an effort to get the system to understand what users meant by “focus on the latest data”, we initially tried to simply provide the current date in the prompt that constructed the search query from the user input. This didn’t have much of any effect, since it did not know the mapping between the current date & what fiscal quarter it was for the company in question.

We tried providing the fiscal calendars of the company and its competitors and told it to do the calculation between current date -> fiscal quarter, then use the fiscal quarter in the search. GPT4 proved to be very bad at this, so we scrapped this idea entirely.

The solution we stuck with was to implement a custom scoring profile on our search index based on “creation date” of the source document. We extracted the creation date of the PDF during the indexing process and added that as a new field on the index. Our custom scoring profile “boosts” search results based on “freshness” of the new date field. Content with a date closer to the present ranks higher than content with a date further back in time.

We started with a boost of “2” which seemed to work quite well. We may experiment with adjusting this number up or down in the future. An added benefit of having a date field on the index is that we can now implement logic to filter on date ranges if the need arises in the future.

One consideration of this approach is the attention to the "creation date" of the source documents. For instance, when a PowerPoint presentation is converted into a PDF format, the file's creation date defaults to the date of conversion, not the original date of the PowerPoint document. This discrepancy may not accurately reflect the timeline we intend to capture. Although the client has accepted this limitation for the current period, it is a topic that we may want to revisit in the future.

?

Questions about multiple companies

Users would often ask about multiple companies in the same question.

“Tell me what the following analysts are saying about XXX: Contoso Equity Research, Northwind Capital, JP Morgan, Citigroup, Susquehanna, TD Cowen, Stifel, Wolfe Research, RBC Capital Markets, Bank of America”.

In order to address these questions, we added another “decision agent” to the flow. This new call to GPT4 would be responsible for determining whether a question should be broken up into multiple, separate questions. The idea was to tackle each part of the question individually, then combine the results into one final, detailed answer. ?So, when the user asks the question, under the hood it will be broken up into the following:

"What is Contoso Equity Research saying about XXX?"

"What is Northwind Capital saying about XXX?"

“What is JP Morgan saying about XXX?”

“What is Citigroup saying about XXX?”

“What is Susquehanna saying about XXX?”

“What is TD Cowen saying about XXX?”

“What is Stifel saying about XXX?”

“What is Wolfe Research saying about XXX?”

“What is RBC Capital Markets saying about XXX?”

“What is Bank of America saying about XXX?”

Each of these questions will be run through the standard process. For each question, a search will be executed, the results passed to GPT4, and a final answer provided. All the final answers would then be concatenated together and returned to the user.

We were on the fence as to whether to do a final “inference” with GPT4 to synthesize all the answers into one, or to just simply concatenate them together. We opted for the latter as it was simpler, and we felt that the former would lead to a much longer response times (it would be almost double). If the users wanted to further summarize, they could ask the chatbot to do so in a follow-up question. The client was aligned with this for the time being.

We initially structured our prompt instructions to break up questions with multiple “entities” instead of “companies”. The thought process was that we could apply this same logic to questions about multiple fiscal quarters, or multiple lines of business, etc. If we search for each line of business separately and combine the answers, the final answer will be more detailed. However, we realized that this was actually providing a worse response in most cases. The reason for this was that, in the data, more often than not all of information would be on a single page or a in a single table. This was very often the case for the different lines of business, or metrics by fiscal quarter. When we split the question up into 5 different questions, all of them would need to land on the same slide when doing the search step. This introduces more opportunity for a miss and is generally a waste of tokens/compute. So, we quickly refined the prompt to only break up questions asking about multiple companies. ??

?

Another challenge with this approach was that sometimes the synthesis of the different answers was necessary to the original question. “Compare and contrast the performance of these 3 companies: XXX, YYY, ZZZ” -> in this scenario, we would return a comprehensive set of data points about the 3 companies, but there would be no actual comparison done by GPT4 (because we are just concatenating the answers together). ?

Rate limits for the Azure OpenAI API would sometimes be hit. We were consuming a massive amount of tokens to address questions about 10-12 companies at one time.

Going forward, I believe the better approach would be to only break up the search step, not the inferencing by GPT4. For our example, we would perform a separate search for each company (only taking the top 2-3 results for each), and then pass all the results to GPT4 for inferencing. This would be much more efficient in terms of tokens/compute, would be more flexible for all question types, and I believe would generally yield a response that better addresses the original question.


?

Filtering

We implemented two changes that provided users the ability to filter down the scope of their queries and to better focus on certain types of data.

1.?????? A “category” tag for each type of data. The users categorized all of the data into 5 distinct categories: SEC Filings, Earnings Transcripts, Internal Data, Analyst Reports, and FactSet Data Sheets. During the indexing process, the data was separated into an individual folder and tagged with the appropriate category tag. We then implemented a drop-down menu on the front-end of the application that let users filter & select the different categories. “All” was the default.

2.?????? The “Summary” field of the index would typically now contain a comment about the source of the data. “Microsoft’s 10k Filing” would be included in the summary as an example. If a user asks “Based on Microsoft’s 10k Filings, what were the latest revenue numbers?”, we would return chunks from the 10k filings in the search results because we are now searching on the summary field. While not a perfect solution, the users now felt that at least the chatbot understood what they were looking for to some extent.

?

Achieving Detailed Responses

Once the search results were improved, the responses became more detailed organically due to the fact that there was now simply more content to synthesize into the final response. In parallel, we also bumped up the max token response parameter of GPT4-turbo to 4000 and added in “few-shot” examples to our main system prompt. These examples were highly detailed and comprehensive, and it was able to use them to adjust its behavior accordingly. ?

?


Reflecting & Key Lessons Learned

We had reached a stopping point in the project and the initial roll-out was deemed a success. While the client was extremely happy with results, we cut it very close from a timelines perspective and almost failed at more than one point throughout the course of the project. Here were my main takeaways:


RAG is more of a search exercise than it is an AI exercise

Generally speaking, if you give GPT4 all the right content, it gives you a high-quality answer. Getting it the right content is the hard part. While we did spend time on prompt engineering and incorporating agents, the vast majority of effort was spent on trying to improve the search results. ?We came in with a working knowledge of the basic Azure AI Search concepts, but we really needed a much deeper, L400-L500 level understanding of the service. While we developed it over the course of the project, we spent a lot of time spinning our wheels on the search aspect.

?

Accelerator Pros & Cons

In terms of the accelerator, the beauty of the pre-built accelerator was that we were able to quickly deliver the “wow” moment to the client. Being able to have the ChatGPT experience on their own data was extremely impactful, and seeing it in a nice, clean web app UI even more so. The challenges arose when the default behavior was not as expected. In order to make a change, you needed a comprehensive understanding of the overall code base. How do all the files, functions, and application routes fit together? Unless you are an experienced application developer with a deep understanding of python (I had neither), developing this comprehensive understanding is going to be a challenge.

?

The ”defaults” are not always the best choice for your use-case

The commonly accepted best practice for vector search is that you break your document up into 512 token “chunks” with 25% overlap between chunks. According to some of the leading research groups, this approach leads to the best results on certain benchmarks. However, this does not necessarily mean that this is the best strategy for your use-case. In our scenario, we would have been better off architecting our indexing & search process from the ground-up, instead of trying to change an existing process to fit our needs. While it is great to deliver that “wow” moment quickly, if we had put extra time into the design & planning phase, the process would have gone more smoothly.

?

Understand Your Data

Developing a deep understanding of the data sources was critical. What context was important, how were they structured, what sort of thought process did the system need to have in order to get the desired output? What kinds of questions would be asked, what sort of answers were we looking for? Even taking a further step back and asking questions like “is a PDF really the best source for internal financial information? Would connecting to a database be a better approach?”. Or maybe “chunking” of documents shouldn’t have been the default strategy. Perhaps we could have considered summarizing each document separately, and then providing the answers in the form of “summary of summaries”.

?

Vector Search is a tool, not a silver bullet

It was important to fundamentally understand how vector search worked & what was happening under the hood in order for us to realize what was going wrong in our search process. Even after we had a fully re-architected, optimized vector search in place, a basic key word search still performed better for some questions.

?

Use LLMs with caution

LLMs are non-deterministic, and they are never going to be perfect. Each LLM call you add into the process introduces a chance for error. You are also adding latency & cost. The increase in value you get from adding an LLM call needs to outweigh the drawbacks. Consider other cheaper, lower latency models for simple tasks, GPT4 might be overkill.

?

Iterate

We were successful because we were willing to try things & abandon them quickly if they weren’t yielding good results. ?You also need an engaged stakeholder/user base who can give feedback, critique your ideas from a business perspective, and make sure what you are delivering aligns with their expectations. This is true of any software development, but even more so in terms of gen AI app development in my opinion.

?

Testing

Create a representative sample of your data set for testing purposes. Indexing in a search engine like Azure AI Search is fairly costly, and in a time-crunched POC you will likely end up re-indexing your data multiple times. You want your testing data to be big enough to be representative of the broader data set, but small enough to be manageable from a cost perspective.

?

?


Going Forward

We had reached a stopping point in the project and the initial roll-out was deemed a success. While the client was extremely happy with what we achieved, being an architect, I naturally started to think about what could still be improved or what should potentially be reworked. What would the next phase entail?

?

Indexing & Search

Our new method of indexing & search was a resounding success, but there is always room for improvement. In general, all of our top search results for a given question were relevant, but were they “the most relevant?”. As I dug through the data, I often found that “the best” page of content for a certain question wasn’t being returned in the search results. For example, if we ask for Contoso’s opinion on a topic, it would return pages 2,3,4, and 5, but not page 1. Upon inspection, page 1 had the most relevant content. What was happening here? I have two ideas:

?

1.?????? We are searching on the GPT-4 generated summary field of our index. These summaries are fairly concise. It could be the case that the vector representation of these summaries are all extremely similar to each other, and the search algorithm has a hard time picking out “the best” one out of N very similar/relevant chunks. Perhaps making the summaries a bit more detailed would yield better results.

2.?????? My other thought is that vector search matches on semantic similarity, but just because something is semantically similar to something else doesn’t necessarily mean it contains “the best” content to answer a given question. Perhaps as part of our GPT4-generated summary, we could have GPT4 “rate” a given page in terms of content-richness, and then boost search results on that field via a custom scoring profile. Another idea might be to add in an agent that determines where to focus the search. The agent might say to itself: “The user is asking for key points from Contoso. First I am going to locate all Contoso reports. The key points are usually on the first page of each report, so I am now going to pull all reports where page=1.” In this manner we would be using an LLM to guide the search process instead of relying purely on the search algorithm itself.

?

There is also room for improvement in terms of how we prioritize the latest data. Right now we are using a custom scoring profile with a boost of “2” in order to boost results based on their creation date. We need to play around with the boost setting to determine the right balance between relevancy & recency. We may also want to “tag” data with a fiscal quarter instead of relying on the creation date of the PDF which could be error prone.

Another option, perhaps more complicated option, would be to do a pre-search filter of only data within the last N days unless an agent deems it necessary to search the full data set. This would make sense in the context of most questions (what is the current market sentiment, what were the latest revenue numbers, etc).

?

Latency

There are many ways we can improve response time that would ultimately improve the user experience.

1.?????? PTU – Azure OpenAI offers “Provisioned Throughput Units” which allow customers to purchase dedicated compute capacity. In the default billing model (called “Pay as you go”), you are essentially sharing compute infrastructure with other customers in that region (your data is completely isolated/secure, you just share the compute used for inferencing). This can lead to high latency during peak hours, and fluctuations in response time which can be frustrating to users. PTU means you always get the absolute fastest response time, every time.

2.?????? Streaming – The idea behind streaming is that you return the response from the LLM in pieces as it is being generated. While it still takes the same total time to generate the full response, this does wonders for the user experience. Say a response takes 60 seconds to fully generate. If you wait for the full response to be generated before returning anything to the front-end, the user needs to wait 60 seconds before they start reading. With streaming, the user can start reading after 2-3 seconds and keep reading as the response generates. ?

3.?????? Caching – We can store responses that we feel are “high quality” in a cache. When a user asks a question and the answer is determined to be in the cache, it is returned immediately. This saves time & compute cost.?

?

General Operational Improvements & Production Readiness

1.?????? Testing framework / LLM ops – What is our baseline accuracy & quality? When we make a change, how does it affect our baseline? Testing & measuring this manually is laborious, time-consuming, and subjective. Establishing LLM Ops is critical for any production-ready LLM application. An Introduction to LLMOps: Operationalizing and Managing Large Language Models using Azure ML (microsoft.com)

2.?????? APIM – Azure API Management is commonly used together with Azure OpenAI to provide resiliency & scaling for LLM-based applications. Azure OpenAI scalability using API Management (youtube.com)

3.?????? Automation of ingestion & indexing – How do we streamline the process so new data flows into the system automatically without manual effort? ?

4.?????? Security model – How do we ensure the system only cites sources the user has access to? How do we handle that situation? Do we have it say “I’m sorry I couldn’t find anything” or perhaps “I found relevant documents, but it appears you don’t have access. Submit an access request <here>.”

?

Potential New Features

1.?????? Bring your own documents – Allow users to upload their own documents and “chat” on them

2.?????? Whole document summarization – For questions like “Summarize Microsoft’s last earnings call”, the current architecture will not suffice. We would only be summarizing the pages that were returned in the search, which is ultimately a subset of the overall document. We would need to revisit our strategy for full document summarization.

3.?????? Incorporate Vision – Can we provide the system the ability to understand charts, graphs, and other financial visuals. Can we provide the system the ability to generate charts, graphs, and visuals from our data? Do we integrate GPT4 with Vision? Do we somehow tap into Power BI copilot? Or perhaps the “Data Analysis” feature that OpenAI has released? ?

4.?????? Database connectivity – If the internal financial reports are compiled from data in a database, perhaps we should go to the source directly. GPT4 is generally very good at constructing SQL queries if given the proper context.

5.?????? Autonomous Agents – The industry is moving towards the concept of “Agents”, which can create plans, use tools, and work through complex problems. As we seek to take this project to the next level, agents will likely play a key role.

?



Closing Thoughts

This first-hand experience has re-affirmed my view that LLMs are set to revolutionize every industry. Many of the creative solutions we employed in this project will very likely end up being eclipsed by new models & architecture paradigms that are poised to arrive in the near future. Although this technology stands as a game-changer, its successful integration hinges on careful planning, iterative development, and forward thinking. To truly capitalize on its potential, organizations must be prepared to adapt their people, processes, and technology in concert.

Jackie Khoa Pham

Head of Product at TSC.ai

1 年

This is a joy to read Dan, comprehensive with the right amount of details. Thanks for sharing!

回复

Fantastic work on the RAG Chatbot engagement! ???? As Thomas A. Edison once said, "I have not failed. I've just found 10,000 ways that won't work." Your struggles and solutions are a testament to this. Thank you for the insights provided in your article! #KnowledgeIsPower ????

Vincent Granville

Co-Founder, BondingAI.io

1 年

Thank you for sharing! See also implementation of fast vector search at scale in RAG context, with enterprise case study, at https://mltblog.com/3UGF0l5

  • 该图片无替代文字
回复
Eamon Moore

Highly Experienced Technology Founder, Entrepreneur, Board Member & Strategic Advisor. Currently, Founder & Executive Chairman, Hikari & Director of Strategic Partnerships, Solliance

1 年

Great post Dan Giannone. Suggest that you have a look at https://www.foundationallm.ai. You’ll find a lot of features and functionality here that you mentioned towards the end of your post. At Solliance we are partnering with numerous Microsoft teams to implement custom copilots on Azure for customers with FoundationaLLM.

回复
Ajay Bawa

KPMG, Director, Microsoft Alliance | Data & AI | Generative AI | Responsible AI

1 年

Loved the article and the details provided. I think RAG is projected as a 'magical' approach while actually it requires a lot of thoughtful design and planning as enumerated in your article.

要查看或添加评论,请登录

Dan Giannone的更多文章

  • The Non-Technical Challenges with RAG

    The Non-Technical Challenges with RAG

    Retrieval Augmented Generation (RAG) has become the generally accepted design pattern for building "chat with your…

    6 条评论
  • Evaluating RAG Applications - A Deep Dive

    Evaluating RAG Applications - A Deep Dive

    A big part of my job is helping clients improve the quality and accuracy of their retrieval-augmented generation (RAG)…

    11 条评论

社区洞察

其他会员也浏览了