登录查看更多内容

Retrieval-Augmented Generation Basics for the Data Center Admin

Frank Denneman

Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom

发布日期: 2024年1月16日

With the help of ChatGPT, Large Language Models (LLM) have captured the imagination of anyone in the world. LLMs baked into products and services can help speed up most human interactions with underlying systems.

Current LLM-enabled apps mostly use 'open-source' LLMs such as Llama 2, Mistral, Vicuna, and sometimes even Falcon 170B. These models are trained on publicly available data, allowing them to react appropriately to most prompts (user questions or instructions). Yet, you or your organization want to have the LLM provide a service on more domain-specific or private data. In that case, a data scientist needs to finetune the model and feed it a reasonable amount of examples. Finetuning is an act that builds on top of the already existing functionality of the model; specific finetuning methods exist, such as LoRA, which freezes the current model weights and adds additional layers of weights (often called adapters) that focus on your domain-specific needs. Training these additional weights takes less time and requires less data when compared to training a model from the ground up. Huggingface recently published an article comparing LoRA finetuning capabilities on different models; they state that LoRA introduces 0.12% of the Llama 2 7B model parameters. This results in a process that only trains 8.4 million parameters. Finetuning can easily be done with a pair of data center GPUs; there is no need for a supercomputer like?Meta's AI Research SuperCluster. That's why we at VMware at Broadcom believe that combining open-source LLMs and finetuning is the path to building strategic business applications.

However, every time you launch a new product or introduce a new service, the data scientist needs to collect data about this new business entity, wrangle the data into a proper data set, and start the finetuning process so that the LLM can answer truthfully to all the prompts the employees or in some cases customers generate.

Introducing the Retrieval Augmented Generation technique is faster, more innovative, and more accurate. Retrieval Augmented Generation (Isn't the term just a dance inside your mouth?), or RAG for short, adds database capabilities to an LLM. So, instead of fitting data in the LLM every time you launch a new service or product, you allow the LLM direct access to the relevant data while generating an answer to the user's prompt.

Of course, it's more complex than adding a database connection to a LLM-enabled app. More needs to be done. Still, with all the ongoing efforts with the data science community, it becomes easier to integrate RAG functionality into your LLM-enabled app. And, of course, we are focused on this use case while we develop VMware Private AI Foundation to provide a scalable and resilient service. Let's dive deeper into the overall process RAG introduces and some components.

Let's start with a simple (non-RAG) LLM process. The user generates a prompt inside the LLM-enabled app (1). The app connects to the LLM and feeds the prompt as input (2). The LLM predicts the words for the output as accurately as possible (3) and feeds the 'prompt completion' back to the app to display to the user (4).

Before diving into the RAG process, let's look at the critical component of RAG, the vector database. A vector database is a database that does not have rows and columns but stores data (points) and text as a numerical value (numerical representation, to be exact). These numerical representations are called vector embeddings, and these are grouped (clustered) based on similarity. Why numerical respresentations??

Randy Lariar 1 年前

Exploring the Impact of ChatGPT Code Interpreter on…

Tony Gillett 1 年前

February 27, 2023

Kannan Subbiah 1 年前

In short, neural network models such as an LLM can only process numbers. So, the Neural Language Processing (NLP) pipeline convers a word into one or more tokens. A vector is a numerical representation of the token that allows the system to structure and analyze the word's meaning and how it relates to other words. If you want to learn more about tokens and vectors, Sasha Metzger published 'A Beginner's Guide to Tokens, Vectors, and Embeddings in NLP.' It is highly recommended!?

With RAG, you allow the database to become the LLM long-term memory. So, how do we use this Vector database? First, we must feed it with the information we want the LLM app to query. To do this, data needs to be vectorized. That means we must convert data into tokens and encode tokens into vector embeddings. The most popular tools today are?Word2Vec,?fastText, and?GloVe. A more comprehensive data framework is??Llamaindex, which provides data ingestion, orchestration, and retrieval tools.

One of the benefits of RAG is that LLM-enabled apps do not have to go offline when extending or expanding your core business. You can 'asynchronously' vectorize data to feed the LLM-enabled app with the latest data. You do not need to retrain or finetune your model every time you release a new service or product; the data engineer introduces the new data to the database regardless of the 'versioning' of the LLM.?

When looking at the process from a user perspective, the user generates a prompt inside the LLM-enabled app (1), the app redirects the prompt to the vector DB instead of directly going to the LLM (2), and the vector database searches on similarity (2) and retrieves the appropriate data (words). The framework sends the data to the app (3) and augments the user prompt with the data retrieved from the vector database (4). The app instructs the LLM to generate a response based on the user's question and provides the data to generate a response (5). The LLM-enabled app presents the answer to the user (6).

RAG behaves like a theater prompter (Souffleur) and provides cues to LLM. Where the prompter provides cues to the actors performing the play, RAG keeps the LLM honest by augmenting the prompt with up-to-date and accurate data. It can lean less on its internalized data, and its primary goal is to formulate an excellent natural language response with the cued data. In essence, the vector database becomes the system of record, while the LLM model and app become the system of intelligence.

Stay tuned for more info about how to deploy vector databases and RAG-enabled apps onto the VCF platform.

G Craig Vachon

Student

8 个月

What if the retrieval was to a digital twin instead of a database? (We've developed such an approach and can assist if you need). Grounding is critical in AI because it enables AI systems to understand and interact with the real world. Without grounding, an AI system might have a hard time understanding context, references, or nuances of various systems. Here are a few reasons why grounding is so important: -Reasoning: Grounding helps AI systems make sense of the physical world and develop a more "common sense" understanding of how things work, enabling them to better navigate and interact with their surroundings. -Trust, safety and reliability: Grounding helps AI systems avoid misunderstandings and errors, making them safer and more reliable for use in real-world applications like autonomous vehicles or medical diagnosis. More importantly, it will significantly reduce hallucinations as the environments will prevent the AIs to provide solutions outside the its boundaries. -Better human/machine collaboration: the user can use natural language as a main or complementary interface for complex collaboration. It allows for an interaction that is more accessible, more nuanced and generalized (no need for multiple specific UIs)

Patryk Wolsza

Cloud Systems Architect at Intel, vExpert ????? | VCAP-CIA | MCSA | EMCCA

8 个月

Frank Denneman how big the Vector DB can be? Are we talking about GB or TB?

Futurum One

8 个月

Understanding the intricacies of Retrieval-Augmented Generation can significantly streamline your data center operations by leveraging external knowledge for enhanced decision-making. ?? Generative AI can not only optimize your workflows but also elevate the quality of your output, allowing you to achieve more in less time. ?? Let's explore how these technologies can transform your current tasks; I invite you to book a call with us to unlock new possibilities. ?? Cindy

查看更多评论

要查看或添加评论，请登录

Frank Denneman的更多文章

Private AI Sessions at Explore Barcelona

2024年10月9日

Private AI Sessions at Explore Barcelona

Attention, AI enthusiasts! VMware Explore is just around the corner, and the buzz about artificial intelligence is…

1 条评论
VMware Private AI Foundation with NVIDIA Explore 2024 Sessions

2024年9月2日

VMware Private AI Foundation with NVIDIA Explore 2024 Sessions

Here is an overview of all the VMware Private AI Foundation with NVIDIA sessions at Explore US 2024. Deep Dive: Running…

1 条评论
Unlocking AI Potential: My VMware Explore 2024 Sessions - A Deep Dive into Private AI, RAG, and Security

2024年8月15日

Unlocking AI Potential: My VMware Explore 2024 Sessions - A Deep Dive into Private AI, RAG, and Security

As we dive further into the AI-driven world, more and more businesses are tapping into massive datasets to improve…
VMware Private AI Foundation - Privacy and Security Best Practices white paper

2024年6月21日

VMware Private AI Foundation - Privacy and Security Best Practices white paper

I'm excited to announce the release of my latest white paper, "VMware Private AI Foundation - Privacy and Security Best…

3 条评论
RAG Architecture Deep Dive

2024年3月19日

RAG Architecture Deep Dive

Retrieval Augmented Generation (RAG) is a technique for augmenting Large Language Model (LLM) knowledge with additional…

5 条评论
The misconception of self-learning capabilities of Large Language Models during Production

2023年11月14日

The misconception of self-learning capabilities of Large Language Models during Production

Last week at Explore, I enjoyed engaging with many customers about bringing Gen-AI to the on-prem data center. A…

4 条评论
vSphere ML Accelerator Spectrum Deep Dive Series

2023年5月3日

vSphere ML Accelerator Spectrum Deep Dive Series

The number of machine learning workloads is increasing in on-prem data centers rapidly. It arrives in different ways…

1 条评论
Exciting Sessions at NVIDIA GTC Spring 2023

2023年3月14日

Exciting Sessions at NVIDIA GTC Spring 2023

Next week GTC Spring 2023 kicks off again. These are the sessions I look forward to next week.

6 条评论
EXCITING SESSIONS FROM NVIDIA GTC FALL 2021

2021年12月9日

EXCITING SESSIONS FROM NVIDIA GTC FALL 2021

Over the last few weeks, I watched many sessions of the NVIDIA Fall version of GTC. I created a list of interesting…
Six Interesting Kubernetes Sessions at VMworld 2018

2018年9月10日

Six Interesting Kubernetes Sessions at VMworld 2018

This year VMworld provided a broad selection of talks focusing on various forms of Kubernetes. Which is not surprising…

See all articles

Retrieval-Augmented Generation Basics for the Data Center Admin

Frank Denneman

Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom

领英推荐

Frank Denneman的更多文章

社区洞察

其他会员也浏览了

The Data Deluge Meets the AI Tsunami: Reshaping the Industry We Know

Chat with Your RAG!

The easy way to integrate data with generative AI

Exploring the Convergence of Federated JOIN & RAG

Embracing Large Language Models in the Enterprise: Challenges and Opportunities in a Vendor-Driven Landscape

A Cure to Hallucinations and Data Leaks?

AI and IT Services - Duel or Duet?

When I want to customize my LLM with data, what are all the options and which method is the best?

Data is Like Gold in Network AI

[#DataForAI] S3/Ep3: Streamlining Data for LLM Fine-Tuning & RAG Success in Generative AI

领英推荐

Frank Denneman的更多文章

Private AI Sessions at Explore Barcelona

VMware Private AI Foundation with NVIDIA Explore 2024 Sessions

Unlocking AI Potential: My VMware Explore 2024 Sessions - A Deep Dive into Private AI, RAG, and Security

VMware Private AI Foundation - Privacy and Security Best Practices white paper

RAG Architecture Deep Dive

The misconception of self-learning capabilities of Large Language Models during Production

vSphere ML Accelerator Spectrum Deep Dive Series

Exciting Sessions at NVIDIA GTC Spring 2023

EXCITING SESSIONS FROM NVIDIA GTC FALL 2021

Six Interesting Kubernetes Sessions at VMworld 2018

社区洞察

其他会员也浏览了

The Data Deluge Meets the AI Tsunami: Reshaping the Industry We Know

Chat with Your RAG!

The easy way to integrate data with generative AI

Exploring the Convergence of Federated JOIN & RAG

Embracing Large Language Models in the Enterprise: Challenges and Opportunities in a Vendor-Driven Landscape

A Cure to Hallucinations and Data Leaks?

AI and IT Services - Duel or Duet?

When I want to customize my LLM with data, what are all the options and which method is the best?

Data is Like Gold in Network AI

[#DataForAI] S3/Ep3: Streamlining Data for LLM Fine-Tuning & RAG Success in Generative AI