Retrieval-Augmented Generation Basics for the Data Center Admin

Retrieval-Augmented Generation Basics for the Data Center Admin

With the help of ChatGPT, Large Language Models (LLM) have captured the imagination of anyone in the world. LLMs baked into products and services can help speed up most human interactions with underlying systems.

Current LLM-enabled apps mostly use 'open-source' LLMs such as Llama 2, Mistral, Vicuna, and sometimes even Falcon 170B. These models are trained on publicly available data, allowing them to react appropriately to most prompts (user questions or instructions). Yet, you or your organization want to have the LLM provide a service on more domain-specific or private data. In that case, a data scientist needs to finetune the model and feed it a reasonable amount of examples. Finetuning is an act that builds on top of the already existing functionality of the model; specific finetuning methods exist, such as LoRA, which freezes the current model weights and adds additional layers of weights (often called adapters) that focus on your domain-specific needs. Training these additional weights takes less time and requires less data when compared to training a model from the ground up. Huggingface recently published an article comparing LoRA finetuning capabilities on different models; they state that LoRA introduces 0.12% of the Llama 2 7B model parameters. This results in a process that only trains 8.4 million parameters. Finetuning can easily be done with a pair of data center GPUs; there is no need for a supercomputer like?Meta's AI Research SuperCluster. That's why we at VMware at Broadcom believe that combining open-source LLMs and finetuning is the path to building strategic business applications.

However, every time you launch a new product or introduce a new service, the data scientist needs to collect data about this new business entity, wrangle the data into a proper data set, and start the finetuning process so that the LLM can answer truthfully to all the prompts the employees or in some cases customers generate.

Introducing the Retrieval Augmented Generation technique is faster, more innovative, and more accurate. Retrieval Augmented Generation (Isn't the term just a dance inside your mouth?), or RAG for short, adds database capabilities to an LLM. So, instead of fitting data in the LLM every time you launch a new service or product, you allow the LLM direct access to the relevant data while generating an answer to the user's prompt.

Of course, it's more complex than adding a database connection to a LLM-enabled app. More needs to be done. Still, with all the ongoing efforts with the data science community, it becomes easier to integrate RAG functionality into your LLM-enabled app. And, of course, we are focused on this use case while we develop VMware Private AI Foundation to provide a scalable and resilient service. Let's dive deeper into the overall process RAG introduces and some components.

Let's start with a simple (non-RAG) LLM process. The user generates a prompt inside the LLM-enabled app (1). The app connects to the LLM and feeds the prompt as input (2). The LLM predicts the words for the output as accurately as possible (3) and feeds the 'prompt completion' back to the app to display to the user (4).


Before diving into the RAG process, let's look at the critical component of RAG, the vector database. A vector database is a database that does not have rows and columns but stores data (points) and text as a numerical value (numerical representation, to be exact). These numerical representations are called vector embeddings, and these are grouped (clustered) based on similarity. Why numerical respresentations??

In short, neural network models such as an LLM can only process numbers. So, the Neural Language Processing (NLP) pipeline convers a word into one or more tokens. A vector is a numerical representation of the token that allows the system to structure and analyze the word's meaning and how it relates to other words. If you want to learn more about tokens and vectors, Sasha Metzger published 'A Beginner's Guide to Tokens, Vectors, and Embeddings in NLP.' It is highly recommended!?

With RAG, you allow the database to become the LLM long-term memory. So, how do we use this Vector database? First, we must feed it with the information we want the LLM app to query. To do this, data needs to be vectorized. That means we must convert data into tokens and encode tokens into vector embeddings. The most popular tools today are?Word2Vec,?fastText, and?GloVe. A more comprehensive data framework is??Llamaindex, which provides data ingestion, orchestration, and retrieval tools.

One of the benefits of RAG is that LLM-enabled apps do not have to go offline when extending or expanding your core business. You can 'asynchronously' vectorize data to feed the LLM-enabled app with the latest data. You do not need to retrain or finetune your model every time you release a new service or product; the data engineer introduces the new data to the database regardless of the 'versioning' of the LLM.?

When looking at the process from a user perspective, the user generates a prompt inside the LLM-enabled app (1), the app redirects the prompt to the vector DB instead of directly going to the LLM (2), and the vector database searches on similarity (2) and retrieves the appropriate data (words). The framework sends the data to the app (3) and augments the user prompt with the data retrieved from the vector database (4). The app instructs the LLM to generate a response based on the user's question and provides the data to generate a response (5). The LLM-enabled app presents the answer to the user (6).


RAG behaves like a theater prompter (Souffleur) and provides cues to LLM. Where the prompter provides cues to the actors performing the play, RAG keeps the LLM honest by augmenting the prompt with up-to-date and accurate data. It can lean less on its internalized data, and its primary goal is to formulate an excellent natural language response with the cued data. In essence, the vector database becomes the system of record, while the LLM model and app become the system of intelligence.

Stay tuned for more info about how to deploy vector databases and RAG-enabled apps onto the VCF platform.

What if the retrieval was to a digital twin instead of a database? (We've developed such an approach and can assist if you need). Grounding is critical in AI because it enables AI systems to understand and interact with the real world. Without grounding, an AI system might have a hard time understanding context, references, or nuances of various systems. Here are a few reasons why grounding is so important: -Reasoning: Grounding helps AI systems make sense of the physical world and develop a more "common sense" understanding of how things work, enabling them to better navigate and interact with their surroundings. -Trust, safety and reliability: Grounding helps AI systems avoid misunderstandings and errors, making them safer and more reliable for use in real-world applications like autonomous vehicles or medical diagnosis. More importantly, it will significantly reduce hallucinations as the environments will prevent the AIs to provide solutions outside the its boundaries. -Better human/machine collaboration: the user can use natural language as a main or complementary interface for complex collaboration. It allows for an interaction that is more accessible, more nuanced and generalized (no need for multiple specific UIs)

回复
Patryk Wolsza

Cloud Systems Architect at Intel, vExpert ????? | VCAP-CIA | MCSA | EMCCA

8 个月

Frank Denneman how big the Vector DB can be? Are we talking about GB or TB?

回复

Understanding the intricacies of Retrieval-Augmented Generation can significantly streamline your data center operations by leveraging external knowledge for enhanced decision-making. ?? Generative AI can not only optimize your workflows but also elevate the quality of your output, allowing you to achieve more in less time. ?? Let's explore how these technologies can transform your current tasks; I invite you to book a call with us to unlock new possibilities. ?? Cindy

回复

要查看或添加评论,请登录

Frank Denneman的更多文章

社区洞察

其他会员也浏览了