The Matrix Retrieved : Part 3 - From RAGs To Riches
Joseph Diamand
VP, Manager, Engineering Lead, Developer. FinTech, AI/ML, BlockChain, Mobile Banking JPMorgan Chase & Co.
Large Language Models (LLMs) have amazing capabilities. They interact with and understand language with seemingly linguistics-on-steroids superpowers. But just like Superman had his vulnerability to kryptonite, LLMs have their own susceptibilities. These operational deficiencies include cut-off dates for their training data, inability to perform mathematical calculations, and the related completion of incorrect or nonsensical information, known as hallucinations.
LLMs are empowered by the breadth and depth of, as well as constrained by the static nature of, their training data. While the world moves forward at the speed of a Bugatti Chiron, LLMs, once their data acquisition phase is complete, will remain static in nature and frozen in time. Retrieval-Augmented Generation (RAG) supplements the knowledge embedded in the model by providing access to up-to-date information as well as to data the LLM may not have been trained on. AI researchers are working to address and mitigate the limitations of LLMs. These efforts have provided solutions that help companies use LLMs in more productive and reliable ways. Before diving into those details, a short detour into the realm of futuristic cyberpunk tech noir follows.
Retrieving Reality: The Matrix Transformed
The Matrix series of films, a sci-fi cinematic marvel of special effects and mind-blowing plot twists, spins a quest-for-truth, good-triumphs-over-evil narrative set in a dystopian world where humanity is trapped in a simulated reality created by intelligent machines. These machines are highly advanced AI's with sophisticated capabilities for simulating an entire world, managing vast amounts of data, and controlling human minds through the Matrix. In the first film, The Matrix (1999), Neo, a highly skilled computer hacker, leads a double life, working a mundane job as a software developer by day while engaging in illegal hacking activities by night. Neo's instincts tell him that something is existentially wrong in the world, and his suspicions are confirmed when he meets Morpheus and Trinity and learns about The Matrix. They orchestrate a jailbreak for Neo, and he joins them in the battle against the AI as none other than The One. In The Matrix Reloaded (2003), Neo intensifies his fight against the AI, seeking help from the Oracle to uncover The Matrix's origins and purpose. In The Matrix Revolutions (2003), a battle between humans and AI reaches a breaking point, Agent Smith becomes a rogue program, and, in the end, Neo is able to negotiate a peace between humans and the AI. In the fourth film, The Matrix Resurrections (2021), the storyline advances 60 years, finding Neo back as a prisoner in the simulated Matrix world, this time as a successful video game developer. A group of rebels, led by the hacker Bugs, liberate Neo from the Matrix a second time.
Neo, in a RAG analogy, interacts with an LLM through prompts and completions, both of which stream out of and into Neo via sensory terminals attached directly to his cerebral neural pathways. The LLM responses are based on representations of the world it has been trained on, i.e., The Matrix, a manifestation of synthetic embeddings generated by the AI nemesis. The embodiment of The Matrix ensures that human captives are constrained in a catatonic state while the energy they generate is harvested by the AI as a matter of their survival. Morpheus and Trinity in the first film, and Bugs in the fourth film, along with their respective crews, represent RAG, augmenting the prompts to Neo's LLM and short-circuiting the simulation generated by the Matrix. RAG provides Neo's LLM with data from the real world, allowing him to see beyond the embedded illusions generated by the Matrix. With this new, RAG-infused perspective, Neo is first able to choose the Red pill, an act that liberates him from The Matrix, and second, to re-enter the Matrix in order to save Morpheus, confront agent Smith, and save Zion.?
Outline
This is the third and final article in the series on neural networks, LLMs, and RAG. The previous articles, available here and here, trace the origins of neural networks, explore the inner workings of LLMs, and include steps to train and adapt LLMs for application-specific deployment. An LLM application, whose training and optimization reach a reliable stage, moves on to the next level on the path to production-ready. The focus of this article is to explore steps developers take to establish an LLM's efficacy, thereby providing their companies with competitive advantages in the marketplace. One popular approach to achieve these results is to implement a RAG solution. Then, in trying to?make LLMs reason better, several additional approaches are available, starting with COTP, then PAL, and finally ReAct.
LLMs Constraints
Some of the primary concerns when deploying LLMs include the cut-off date problem and the hallucination problem. The cut-off date problem is related to the time when the model's pre-training is complete and its internal knowledge that has been captured and frozen. The model does not have access to data after the cut-off date unless it is retained later or is provided in a prompt. In the hallucination problem, the model generates plausible-sounding text that has no factual basis.?
Hallucinations can happen when prompts request information on data the LLM has never seen before or on data the model has been trained on but contains conflicting information. Hallucination can also occur when the prompt refers to events that never happened. Crafting clear, specific, and factual prompts will steer the model towards more reliable completions.?
To improve model performance, human reviewers can check and correct the completions for accuracy. Automated solutions that implement post-processing filters and rules designed to detect and correct hallucinations, are good optimizations to the human reviewer approach.?
Another way to limit the impact of hallucinations and address the cut-off date problem is to? implement a Retrieval-Augmented Generation solution. With RAG, the model is provided with relevant information from external sources.
Orchestration
When a user chats with LLMs, they generate a prompt, the LLM receives the user input, processes it, generates a response, returns the response to the user, which is then displayed at the chat terminal. Behind the scenes, LLMs are integrated with additional flows that optimize the completion to the user. An orchestration library, positioned between the prompt and the LLM, assembles multiple components to provide connections to external data sources, including the Internet, APIs, databases, PDF documents, and more. The library manages workflows by coordinating the interactions between different components, ensuring efficient data processing, addressing any errors or retries, and providing logging and analytics for system monitoring and evaluation. Orchestration libraries can be built with widely used frameworks such as LangChain and LlamaIndex. These frameworks are designed to simplify LLM application development by providing tools for seamless integration and management of various data sources and services.
LangChain is well-suited for a broad range of applications, including chatbots, question-answering systems, and text summarization. Developers can connect modular components to build custom LLM workflows that incorporate chains for combining multiple LLM calls and agents for interacting with external tools. LangChain provides for data preprocessing, model deployment, monitoring, and? integration with various data sources, enabling seamless retrieval and usage of external data. LangChain is open-source, supported by an active community, and provides extensive documentation on usage and implementation. LangChain provides a great deal of flexibility and control, at the expense of additional custom development and integration work.
LlamaIndex, like LangChain, is a powerful framework designed to extend the capabilities of LLM applications. LlamaIndex manages large volumes of data, streamlines the process of indexing and querying document collections, and makes it easier to get started with RAG. The LlamaIndex library connects to numerous data sources, databases, and infrastructure, such as cloud storage and on-premises servers, and includes query engines to retrieve document sections relevant to the prompt. It supports a wide range of formats, including PDFs, text files, and HTML, and provides data preprocessing utilities, such as tokenization and filtering, that helps to improve the accuracy of information retrieval.? LlamaIndex offers capabilities for metadata management and customization options, so developers can tailor the indexing and retrieval processes to specific needs. Some would venture so far as to say that the learning curve for integrating with LlamaIndex is less steep than for LangChain.
RAG
The training knowledge for an LLM model, embedded in the parameters of its neural network, is known as its parametric memory. This is adjusted during training and fine tuning, so the LLM learns to recall preferred patterns for making predictions or performing tasks. Retrieval-Augmented Generation (RAG) is a framework that operates orthogonally to the model training, providing the model with access to data that is not in its training set. RAG was first introduced in a paper presented at the Advances in Neural Information Processing Systems (NeurIPS) 2020 conference. The paper, from Facebook AI Research, proposed a hybrid model to improve the accuracy and relevance of LLM applications by combining parametric memory with external data sources represented as vector indices, in a database. The paper centered on a neural retriever to access a dense vector index of external data sources; in this case it was Wikipedia. The neural retriever locates the parts of documents that are most related to the prompt, then combines the retrieved parts with the original prompt and sends that to the LLM.
RAG helps the model to overcome the knowledge date cutoff issue. It can also help to mitigate a model's hallucination problem. When RAG preprocesses documents, such as PDFs, it segments them into smaller chunks, making the documents more manageable and improving the efficiency of information retrieval. Smaller chunk sizes make it easier to work within the LLM's context windows, which have a set capacity for processing information. Next, chunks from the preprocessing stage are decomposed into tokens. The tokens are converted into vector representations in an embedding space. Embeddings convert words into numerical vectors that capture their meanings, which is later used to retrieve chunk(s) that are semantically related to prompt.
Vector databases store vector representations, indexing the representations with a unique key derived from the tokens or phrases they represent. When a prompt is entered into the system, its relevant parts are converted into a vector representation using the same embedding space as the RAG documents. The vector database is used to search for vectors that closely match the prompt, using measures like cosine similarity to determine semantic relevance. The search locates the most relevant chunks of data, and then adds them to the original prompt. Because vectors are indexed by unique keys, these keys can be used to generate citations for the original documents, ensuring that the generated responses are traceable to their sources.
Access Documents
RAG is designed to access text files, knowledge bases, internal wikis, research papers, and any other form of textual data relevant to the application. RAG systems interact with documents in various formats, including PDFs, Word documents, plain text files, Markdown files, HTML files, OCR scanned images, etc. Documents can be sourced internally from company resources like wikis, knowledge bases, and technical documentation, or externally from research papers, articles, news websites, and public datasets. The preprocessing phase involves text extraction, cleaning, chunking into smaller segments for efficient indexing, and converting text into tokens then into numerical vectors that can be stored in a vector database.
Access Databases
RAG systems interact with databases to retrieve structured, semi-structured, and unstructured information for business use cases. A variety of database types are supported, including SQL databases like PostgreSQL and MySQL and NoSQL databases like MongoDB and Cassandra. Specialized databases designed for enterprise data warehouses (e.g., Snowflake, Redshift) are supported too. RAG can work with customer relationship management (CRM) systems (e.g., Salesforce, HubSpot), financial databases (e.g., Bloomberg, FactSet), and graph databases (e.g., Neo4j). Accessing data sources through RAG enables prompts to incorporate factual information, statistics, customer records, financial data, relational data, insights into relationships and connections, and other relevant data points.
Access APIs
RAG leverages Application Programming Interfaces?to tap into a variety of data sources. Internal data can include company-specific information such as purchase history, account details, or employee records. External APIs connect to encyclopedic knowledge bases like Wikipedia's API, specialized content like Google Knowledge Graph, news feeds, product catalogs, and other data sources. Real-time data APIs provide access to up-to-the-minute information such as market data, weather forecasts, flight data, and traffic updates. RAG's use of APIs helps the LLM access current and specialized information, and helps to improve the ability to provide accurate, real-time, and contextually relevant responses.
Access Internet
RAG systems overcome cut-off date limitations by integrating with web search engines like Google or Bing. The LLM can access up-to-date information, news articles, social media posts, online forums, and other publicly available web content, and provide responses that reflect the latest and most relevant developments and trends. This capability can be used, for example, to answer questions regarding recent occurrences, analyzing sentiments on social media, and tracking real-time market data.
Vector Database
A vector database is a specialized data storage system designed to efficiently manage and search high-dimensional vector representations. Vector databases were developed to support the growing demand for machine learning and AI applications, particularly in natural language processing and computer vision. Unlike traditional databases that store structured data, vector databases handle data as numerical representations that capture semantic meaning. Each vector corresponds to an item or chunk of data, such as a document or an image, and is stored along with a unique identifier or key for retrieval. These databases utilize advanced indexing techniques, such as KD-trees or approximate nearest neighbor (ANN) algorithms.
To prepare a document for Retrieval-Augmented Generation (RAG), the document is partitioned into chunks. Chunking creates smaller and more manageable segments, enhancing the accuracy of information retrieval, and improving the efficiency of the augmented prompt submitted to the LLM. The words in the chunks are tokenized, dividing the text into individual tokens. Following tokenization, tokens are converted into high-dimensional vectors that capture their semantic meaning in the embedding process. This is what the vector database uses when searching for relevant documents.
During the retrieval phase, when a prompt is entered and needs to be augmented by RAG with a stored document, the prompt is tokenized and converted into vectors. The vector representations of both the prompt and document chunks enable efficient similarity searches within the vector database, allowing the system to find and retrieve the most relevant document chunks related to the prompt.
High-dimensional vector representations consider vectors located close to each other in the vector space as similar, indicating that the data they represent shares significant attributes or features. Metrics such as Euclidean distance or cosine similarity measure the distance between vectors, with smaller distances indicating greater similarity. For example, a prompt asks how the cash position of a company changed during the current fiscal year and what the future outlook looks like. In response, the vector database search engine locates the relevant chunks from the company's latest 10-K. RAG adds these chunk(s) to the prompt and then submits the augmented prompt to the LLM. The LLM processes the cash flow information, build financial models that captures the changes in position and uses the supplied data to forecast future cash flows.
Vector databases are available as commercial products as well as from open-source projects. Examples of open-source vector databases include Chroma, Weaviate, and Milvus. Pinecone, MongoDB Atlas Vector Search, and Vespa-ai are examples of commercial vector databases. Chroma, an open-source database, provides a Pythonic API that easily integrates with LangChain. Weaviate supports graph-based relationships and semantic understanding, has a steep learning curve, and requires extensive configuration and setup. Milvus is a high-performance solution, designed for large-scale datasets and real-time queries, requiring dedicated infrastructure for optimal performance.
Commercial vector databases include Pinecone, an easy-to-use, reliable, fully managed, scalable, high-performance solution for production workloads. MongoDB Atlas Vector Search is a convenient solution for MongoDB users; however, its dependence on the MongoDB ecosystem limits its adoption. Vespa-ai is a high-performance, flexible, real-time indexing and machine learning model, also with a steep learning curve for setup and configuration.
AI Conversational Agent
AI Conversational Agents (CAs) engage in meaningful dialogue, appear to understand user inquiries, and respond to requests by performing specific tasks. These agents can be deployed in customer service centers to connect users with services, answer questions, and respond like a human representative. Companies develop CAs to function as advanced virtual assistants, and to manage meeting schedules, agendas, participants, and summaries. CAs can assist with onboarding new employees, provide training on company policies, and address urgent employee questions, mirroring the role of an HR professional.
In a scenario where an employee interacts with a Conversational Agent that provides HR services, the CA can utilizes the RAG Access Database subsystem to retrieve employee records as needed and include that with employee's prompt. For a specific query regarding company policy, the CA uses the RAG Access Document subsystem to search for relevant document chunks and attach them to the prompt. If the request requires access to an internal API endpoint or a partner service, such as an insurance update, medical claim, or transit benefit, the CA uses the RAG Access API subsystem to securely transact with the endpoint, providing the necessary identification details retrieved previously. RAG could, for example, send the response from the API call back to the LLM, the LLM then generates SQL code, which then is sent to the RAG Access Database subsystem to store the updates. If an employee query requires up-to-date web information, such as a weather forecast or operating hours, the request is handled by the RAG Access Internet subsystem, which performs the search, aggregates the details, and sends the information to the LLM for formatting.
LLMs act as the controller in the decision-making process for the CA. To trigger actions, the completions generated by the LLM provide a set of understandable instructions to guide the CA's actions. For example, an instruction in the completion may request the CA to collect information from the user for validation. This initiates a series of actions, using RAG to complete the required tasks.
An LLM CA for a specific industry, like an insurance AI agent, can be fine-tuned to understand context, identify necessary information, and generate appropriate instructions for the specific industry business domain use case. When flows require specific actions, such as the initial intake of an insurance claim, the LLM application is trained on the necessary steps to complete the process. The training includes examples of insurance claims, relevant policies, and typical user interactions. When the LLM generates "What is the policy number?", the application prompts the user for the requested information. The LLM guides the application through the necessary processes, from collecting initial claim information to validating details and initiating further actions. When the application performs multiple steps and reasons through multiple decision points, as in the above example, it is most likely using advanced reasoning techniques.
领英推荐
Enhancing LLM Reasoning
Enhancing LLM reasoning involves improving the logical and problem-solving capabilities of large language models, making them more effective at handling complex, multi-step tasks.
COTP and PAL
A successful approach to helping an application learn to work with intricate application flows is called Chain of Thought Prompting (COTP). COTP simplifies and solves problems by breaking them down into individual steps, solving those steps one at a time, and then aggregating the outputs into a composite answer. When implementing COTP, the model is provided with examples of complex tasks and how to decompose them into a series of intermediate reasoning steps. By following these steps in succession, the model learns how to reason through similar problems and is more likely to take the same approach as needed. However, even COTP may not be enough on its own if, for example, one or more of the intermediate steps require precise mathematical calculations.
The Program-Aided Language (PAL) model is a framework for the LLM to carry out programmatic work, such as performing mathematical calculations. In conjunction with COTP, the model generates a script-like response that can perform the mathematical tasks and return the required results. PAL provides the LLM with example problems, the reasoning steps to solve the problems, and instructions on how to format those steps so they can be run by a code interpreter. The prompt sent to the LLM includes these PAL examples. The LLM generates the completion in the form of a script, complete with comments. The orchestration library, sitting between the LLM and the application, is triggered by the comments from the LLM. These triggers may be the ones that instruct the orchestration library to send the script to the Python interpreter. The orchestration library will collect the output from the interpreter, add it to the previous response, and sends it back to the LLM. The LLM pieces it all together and returns the correct completion. The COTP/PAL combo can handle simple flows, but when the application requires multiple decision points involving interactions with multiple data sources or external interfaces, an even more powerful framework is needed.
ReAct
ReAct, short for "Reason+Act," is a prompting strategy that combines reasoning and action planning for the LLM. Proposed by researchers at Princeton and Google in 2022, this framework enables the LLM to manage high-level plans and interact with external sources as needed. ReAct supplies the LLM with structured examples, allowing it to reason through problems. As the LLM progresses through the reasoning process, it makes decisions on actions that incrementally work towards a solution.
ReAct uses a trio of elements, thoughts, actions, and observations, in an interleaved manner to guide the model through the reasoning and action process. Thoughts represent the reasoning process and provide an outline of the steps for the model to take. Actions represent what needs to be done, such as interacting with an external system or searching a document. To interact with an external application or data source, the model must identify an action to take from a predetermined list. These actions are prepended to the example prompt text.
In the observation phase, the new information, such as that provided by the external search, for example, is brought into the context of the prompt and utilized by the model as feedback to determine the next steps. The trio of elements is integrated into the prompt, interleaved so that the model can follow the logical flow of the complex task, and the cycle is repeated as many times as needed until the final answer is obtained.
The following small example illustrates how the ReAct framework composes the elements of thought, action, and observation to help the LLM navigate through a multi-step task:
Thought: ?To determine the policy number, I need to look up the user's profile.
Action:? Query the database for user profile information.
Observation:? ?Retrieved user profile information with policy number: 123456.
Thought: ?Now, I need to verify the policy number against the claims database.
Action:? Check the claims database for policy number 123456.
Observation:? ? ? Policy number 123456 is valid and active.
Reasoning Frameworks
Techniques like COTP and PAL provide effective methods for breaking down and solving complex tasks, but they have limitations as the reasoning tasks become more complicated. The ReAct framework demonstrates a way to use LLMs to power an application through reasoning and action planning. For more specific use cases, this strategy can be enhanced by integrating with frameworks like LangChain and LlamaIndex.
LangChain
The LangChain framework enhances the reasoning capabilities of LLMs by providing modular components such as prompt templates, pre-built tools, and memory integration, which can be connected into chains. When needed, agents can be added for additional control to work with external data sources, APIs, and tools. Prompt templates are available for many different use cases and can be used to format both input examples and model completions. LangChain includes pre-built tools to carry out a wide variety of tasks, including calls to external datasets and various APIs. When the individual LangChain components are connected, they resemble a chain, hence the name. LangChain offers a set of predefined chains optimized for different use cases, which can be used off-the-shelf to quickly get an app up and running. Sometimes an application workflow could take multiple paths depending on the context. In such cases, a predetermined chain may not be available, and an agent is used to interpret the input and determine which tool or tools to use to complete the task. LangChain includes agents for both PAL and ReAct, among others.
LlamaIndex
While LangChain focuses on enhancing the reasoning capabilities of LLMs by providing modular components for task execution and action planning, LlamaIndex specializes in optimizing data indexing and retrieval. Integrating LlamaIndex with the ReAct framework can enhance the capabilities of LLM applications by improving data retrieval, reasoning, and action processes. LlamaIndex bridges the gap between LLMs and external data sources, APIs, and tools, facilitating comprehensive reasoning tasks, offering modular components as indexing templates, retrieval algorithms, and data integration tools. These components are combined to optimize the performance of LLMs in data-intensive applications, enabling efficient processing of large datasets, complex documents, and structured data.
The LlamaIndex's framework supports a range of applications, including question-answering systems, text generation, chatbots, virtual assistants, and data analysis. As an open-source project, LlamaIndex benefits from continuous community-driven contributions. When LlamaIndex is combined with LangChain, even more powerful LLM reasoning applications can be created.
Final Thoughts
LLMs have made significant strides in understanding and generating human-like text. This series of articles explored neural networks, Large Language Models, Retrieval-Augmented Generation, and advanced reasoning technique for LLMs. This article highlighted RAG, which represents a significant approach to improve LLM responses. Techniques like COTP, the PAL model, and the ReAct framework provide structured approaches to tackling complex tasks, while frameworks like LangChain and LlamaIndex offer tools for integrating and retrieving data.
The journey from neural networks to LLMs to RAG and LLM reasoning underscores the benefits and importance of continuous innovation and collaboration in the field of artificial intelligence. By leveraging these advanced techniques and frameworks, public agencies, private companies, and individuals have the ability to build robust AI systems that responds to the complexities of the real world like never before. We can only imagine what the future will bring.
References
Berryman, J., & Ziegler, A. (2025). Prompt Engineering for LLMs: The Art and Science of Building Large Language Model-based Applications (Early Release, Raw & Unedited). O’Reilly Media.
Fregly, C., Barth, A., & Eigenbrode, S. (2023). Generative AI on AWS. O'Reilly Media, Inc.
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2022). PAL: Program-aided Language Models. Carnegie Mellon University. arXiv. https://arxiv.org/abs/2211.10435
Lála, J., O'Donoghue, O., Shtedritski, A., Cox, S., Rodriques, S. G., & White, A. D. (2023). PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. arXiv. https://ar5iv.org/abs/2312.07559
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rockt?schel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/file/6b493230205f780e1bc26945df7481e5-Review.html
Silver, J. (Producer), & Wachowski, L., & Wachowski, L. (Directors). (1999). The Matrix [Motion picture]. United States: Warner Bros.
Silver, J. (Producer), & Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Reloaded [Motion picture]. United States: Warner Bros.
Silver, J. (Producer), & Wachowski, L., & Wachowski, L. (Directors). (2003). The Matrix Revolutions [Motion picture]. United States: Warner Bros.
Silver, J. (Producer), & Wachowski, L. (Director). (2021). The Matrix Resurrections [Motion picture]. United States: Warner Bros.
Yao, S., Cao, Y., Narasimhan, K., & others. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. Princeton University & Google Research. arXiv. https://arxiv.org/abs/2210.03629
Thank you for sharing these insights on the advancements in artificial intelligence and machine learning. The intersection of retrieval augmented generation and large language models presents exciting opportunities for innovation. It would be interesting to hear your thoughts on how AI frameworks can further enhance application development in this space.
Accounting, Finance and Technology
4 个月Nice Joe, thought provoking, IMO LLM's are just statistical machines, albeit very adept at predicting the next token. Ultimately, this conclusion is that once the seed for the rand function is fixed the LLM's will produce the same results for every prompt.