Phasing out our v1 RAG engine

Phasing out our v1 RAG engine

Insights from being in production LLMs in 12 months and why we feel the current system is no longer needed?

We are switching to a new RAG based system and copilot implementation and as a result are phasing out all the work we had done in the previous version. I felt this would be a great opportunity for all of you to look at what we were doing, especially in high recall fields like Financial Services.

The 3 main components whenever looking at a particular RAG based system:

  1. Information Indexing
  2. Query Understanding
  3. Search Mechanism

Let me go over our previous implementation and at the end we will go over why we are phasing them out. And a quick snapshot of what will come next.


Information Indexing

In the previous version, we were mostly oscillating on structured parsing of financial metrics and top 10 question generation from documents based on page-level chunking with 10% overlap. Anything more complex we tried, customers uploaded PDFs that broke the new system in hilarious ways.


Query Understanding

The main problem that you will always end up with any RAG system is that you will have indexed the following question from the annual report:

What is the median remuneration of board of directors of Infosys?

Whereas the user question will end up being:

What was the median salary of Indian IT company CEOs over the last 3 years?

So you need to ask followups until you reach something that looks like this:

What was the median salary of the board of directors and key management personnel of Infosys, TCS, Tech Mahindra, Mphasis, Oracle financials in FY22, FY23 and FY24?

The easiest way to accomplish this is to have a fixed pydantic model that you are parsing the user question into and finding the missing fields based on a fixed schema. You can use the following model to reliably fill in context for equity analysts:

class StockExchanges(str, Enum):
    BSE = "BSE"
    NSE = "NSE"
    NYSE = "NYSE"
    NASDAQ = "NASDAQ"
    
class BasicAnswerContext(BaseModel):
    financial_sectors: Optional[List[str]] = Field(description="Extract all the relevant financial sectors for the companies, indices or funds that are mentioned in this particular question for example Banking, Finance, Education, Agriculture, etc")
    market: Optional[List[StockExchanges]] = Field(description="Extract the names of the relevant exchanges for all the companies, indices or funds that are mentioned in this particular question. If listed on multiple exchanges, give multiple items in the list. Eg, BSE, NSE, NASDAQ, NYSE")
    entities: Optional[List[StringOptions]] = Field(description="The exact market ticker for all the companies mentioned in the extracted context. Just pick the company tickers, no extra exchange parameters.")
    num_days: int = Field(description="Number of days for which the research is to be done. If user has not mentioned any time period, then this should be 0", default=0)
    research_timeframe_end: Optional[str] = Field(description="End day for the mentioned time-span in YYYY-MM-DD string format. Do not assume values, but use basic logic like mapping 'latest/newest/most recent' from today's date to 'last month', 'last quarter', 'previous quarter', etc.")
    research_timeframe_start: SkipJsonSchema[Union[str, None]] = None
    financial_metric: Optional[List[str]] = Field(default=[], description="All the quantitative financials which are reported about a company, say for instance, EPS, P/E Ratio, P&L, Debt to Equity ratio, etc. If user has explicitly mentioned a certain financial metric put that in list only or else provided nearest related financial metrics. EXPAND ANY FULL FORMS.")
    non_financial_metric: Optional[List[str]] = Field(default=[], description="Qualitative information about the company all the way from new CEOs to product launches.")        


You can take the parsed context from this class at the end of all your followup questions and you can either make a standalone question or just pass this as the JSON context to the model. Both approaches work equally well, depending on your search algorithm.

Using the above approach we were able to achieve a p99 recall of 0.90 at 2 copilot steps and 0.92 at 3 or more copilot steps.

Search Mechanism

We made the most changes in this version from vector only to full-text + reranker only to structured extraction and finally settled on a combination of all of the following that looked something like this:

class WebSearch(BaseModel):
    queries: List[str] = Field(default=[])
    start_date: str = Field(default="")
    end_date: str = Field(default="")

class ODMInput(BaseModel):
    financial_entities: List[str] = Field(default=[])
    financial_metrics: List[str] = Field(default=[])
    search_text: str = Field(default="")
    order_by: Optional[Dict[str, str]] = Field(default=None)
    projection: List[str] = Field(default=None)
    limit: Optional[int] = Field(default=None)
    start_date_range: SkipJsonSchema[Union[str, None]] = None
    end_date_range: SkipJsonSchema[Union[str, None]] = None

    @classmethod
    def get_knowledge_cutoff(cls):
        return "18 July 2024"


class StructuredParsing(BaseModel):
    doc_type: DocType = Field(default=DocType.EarningsCall)
    financial_entities: List[str] = Field(default=[])
    financial_metrics: Optional[List[str]] = Field(default=None, description="all relevant financial metric names")
    financial_years: Optional[List[int]] = Field(default=None)
    financial_quarters: Optional[List[str]] = Field(default=None, description="This field always follows the format like: <Q4FY2024>")

   
    @classmethod
    def get_knowledge_cutoff(cls):
        return "1 August 2024"

class VectorSearch(BaseModel):
    search_text: str = Field(default="")
    doc_type: DocType = Field(default=DocType.EarningsCall)
    financial_entities: List[str] = Field(default=[])
    search_config: Dict[Any, Any] = Field(default={})

    @classmethod
    def get_knowledge_cutoff(cls):
        return "21 August 2024"

class GenerateAgenticSearchQuery(BaseModel):
    structured_parsing: Optional[StructuredParsing] = Field(default=None)
    vector_search: Optional[VectorSearch] = Field(default=None)
    nosql_search: Optional[ODMInput] = Field(default=None)
    web_search: Optional[WebSearch] = Field(default=None)
        

While this is a nice extraction to visualize this is overly complex and confuses the model often between extraction of relevant metadata and the nuanced search keyword extraction itself. When done in multi-step this may behave unstable.


Next Version

So, why are we moving on from this type of RAG system? There are a few reasons:

  1. A lot more of our current usage and most of our future usage will be based on known financial documents like Annual Reports, Earnings Calls, DRHP, BRSR, Analyst Ratings (credit/equity), Fund Factsheets, Bank Statement and most importantly regulatory documents. This makes question generation and embedding unnecessarily wasteful and many a times even counterintuitive as an information index.
  2. Our customers most prefer outputs that are in report or excel formats and doing close ended followup research/iteration on those documents itself. This requires us to spend much more compute on query understanding than our previous approach could offer.
  3. Finally, we felt the search mechanism above proves both too fragile and too strict to cover up for any context missed during query understanding. And we have made the overall schema simpler.


Our newer approach focuses much more on Persona-based document understanding as a pre-step to information indexing based on the workflow and usage data we have gathered over the past 12 months. So for example: what is the superset of all the factual information a credit manager would want from an SME's bank statement?

Vikash Rahii

Software Developer @Greychain || Node J.S || Express || Nest JS || Azure || Sql/Nosql

5 个月

Informative!

Kanean S A

India Startup Success Lead, Google Cloud

5 个月

Insightful one Priyesh!

Krishnatejaswi Shenthar

SDE Intern @ RingCentral | Making computers do stuff | Senior- Dept of CSE @RVCE | Senior Core Member@ CodingClub, RVCE

5 个月

要查看或添加评论,请登录

Priyesh Srivastava的更多文章

  • DFS-RAG

    DFS-RAG

    Since OpenAI O1 came out I have been seeing a lot of discussion around Chain of Thought + Reinforcement Learning…

  • MoE vs Ensemble (Part 2 for technical folks and AI folks)

    MoE vs Ensemble (Part 2 for technical folks and AI folks)

    The core idea of an ensemble model is say you are training a very simple model to learn the function on a particular…

    1 条评论

社区洞察

其他会员也浏览了