登录查看更多内容

The misconception of self-learning capabilities of Large Language Models during Production

Frank Denneman

Distinguished Engineer | Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom

发布日期: 2023年11月14日

Last week at Explore, I enjoyed engaging with many customers about bringing Gen-AI to the on-prem data center. A tremendous amount of customers want to keep their data and IP between the four walls of their organization, and rightly so.?

With VMware Private AI Foundation, we aim to utilize foundation models and build upon the great work of many smart data scientists. Foundation models like Llama 2, StarCoder, and Mistral 7b. Instead of building and training a large language model (LLM) from the ground up, which can be time-consuming and computationally expensive, organizations can leverage foundation models pre-trained on a massive dataset of text and code. If necessary, organizations can further fine-tune a foundation model on specific tasks and data in a short period of time.

At VMware, we believe in using Vector DBs with Retrieval Augmented Generation (RAG) to decouple the IP and data from the foundation model. Using RAG, you offload the knowledge updates to another system so the Gen-AI application is always up to date. The vector DB is used for memorizing facts, while the foundation model is used for reasoning functionality. If necessary, the foundation model can be replaced by a newer version. Typically, this doesn't happen, but if a data science team thinks it will improve the Gen-AI application's reasoning or generative capability, they can do that without losing any IP.

And this particular fact, not losing any IP by replacing the model, got me some pushback from the people I spoke to. By digging into this topic a bit more, I discovered misconceptions among many about the learning ability of a neural network (LLM) model.?

When you use an LLM, i.e., asking it a question (prompt), the model does NOT learn from your question. LLMs have no self-learning mechanisms during the deployment phase (inference). Let's dive a bit deeper into the difference between inference and training.

Inference:?When asking a model a question, you ask it to make a prediction, and the model feeds the input data to the network and its weights (parameters) to compute the output. This sequence is also called a forward pass.?

Data scientists freeze the parameters when the neural network is accurate enough, and inference uses the same parameters for every question during inference.?

Training:?When training a neural network, the first step is called the forward pass, which is the same as inference. The forward pass calculates the prediction of the model for the input data. After the forward pass, the loss is calculated by comparing the prediction to the expected result. The loss is used to calculate a gradient. The gradient guides the framework to increase or decrease the values of each parameter. The Backpropagation pass adjusts the parameters layer by layer to minimize the loss.

The training process repeats the forward pass and backpropagation until the model converges, meaning the loss no longer decreases. Once the model converges, a checkpoint is created, and the parameters are frozen. The Gen-AI application uses this version of the model to generate answers.

I believe the misconception of self-learning capabilities occurs when thinking about either recommender systems (Netflix proposing which series to look at next or Amazon telling you other customers bought item X along with item Y), but a recommender system uses a converged model (frozen weights), with an online feature store. An online feature store provides real-time access to essential features for generating accurate and personalized recommendations. Amazon uses an online feature store to store features about its products and users.?

Sebastian Raschka, PhD 3 个月前

What is Cognitive Computing and How Does it Work?

Neil Sahota 2 年前

Where Semantics and Machine Learning Converge

Kurt Cagle 2 年前

Product features:?These describe the products themselves, such as their price, category, popularity, brand, color, size, rating, and reviews.
User features:?These describe the users, such as their past purchase history, demographics (age, gender, location), interests, and browsing behavior.

Suppose an Amazon customer has purchased many books in the past. In that case, the recommender system might recommend other books to the user based on collaborative or content-based filtering. With collaborative filtering, the algorithm identifies users with similar tastes to the target user and recommends items that those users have liked. With content-based filtering,?the algorithm recommends items similar to items the target user has liked in the past. It seems like the model is learning when using it. In reality, the model really calculates predictions by using its parameters and data (features) from the online feature store (a database).

The?context window?is another example of why a Gen-AI application like ChatGPT seems to be learning while using it. Models can appear to "learn" during inference in the same context, but it is because the same information is still part of its context window. The best analogy for an LLM context window is the working memory of a human. For example, if I ask Llama 2, "Who is the CEO of VMware?" and the answer it returns is "Pat Gelsinger," and I respond with, "No, it's Raghu," and it responds with, "Yes, that is correct, the CEO of VMware is Raghu Raghuram." Then, if I ask it, "Who is the CEO of VMware?" it will respond with "Raghu Raghuram," but that is only because the same context has the answer. Google's Bard supports up to 2048 tokens, and ChatGPT supports up to 4096 tokens. By default, Llama 2 has a context window of 4096 tokens, but?Huge Llama 2?supports up to 32K tokens. A token is a unit of text that the model uses to process and generate text. Tokens can be words, subwords, or characters.?

Everything comes at a price. The larger the context window is, the more memory and compute inference are consumed. Outside of the same conversational context, models are completely stateless and cannot permanently "learn" anything in inference mode. What most LLMaaS do is capture the user prompts, which might contain sensitive data, and use it to form training data sets. On top of that, they use a technique called?Reinforcement Learning with Human Feedback?(RHLF), which allows it to learn more frequently. Chip Huyen published an?excellent article describing RLHF?in detail while being understandable for non-data-scientists.

Why does this distinction matter of self-learning matter for non-data scientists? By understanding that default foundation models do not alter state during use, VI-admins, architects, and application developers can design an infrastructure and application that offers high availability while offering a proper life cycle management strategy for the Gen-AI application.?

For example, the checkpoint of a model can be loaded and exposed by an inference framework running in separate VMs on separate accelerated ESXi hosts with an anti-affinity rule that ensures that each model API endpoint will not share the same physical infrastructure to reduce the blast radius. A load-balancer in front of the inference frameworks offers the flexibility to take one inference framework offline during maintenance jobs without seeing some behavior change. As the model is frozen, no model version would have learned more than the other during their service time.

We are currently building and developing VMware Private AI Foundation to allow VMware customers to deploy LLMs on-prem securely and keep the data and Gen-AI application access secure. Private data should be kept private, and together with using state-of-the-art foundation models, organizations can safely build their Gen-AI applications to support their business goals.

A special thanks goes out to @Steve Liang for making this article more informative.

Jankees Ruizeveld

Head of Robeco's TMT Credit Research

10 个月

Hi Frank, if the models do not learn when using it, In which way can company data and IP leak out when using these models/tools?

Michael Francis

1 年

Excellent article with good analogies. The only comment I would add - is this whole space is moving so quickly - what is a constraint today, may not be a constraint in a month.

1 次回应

Pim van de Vis

Lead Solution Engineer

1 年

great article, thanks for clarifying and sharing this.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

The misconception of self-learning capabilities of Large Language Models during Production

Frank Denneman

Distinguished Engineer | Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Beyond Imitation: Redefining Artificial Intelligence in the Modern Era

Key Differences Between Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL)

This weekend I used Google's new NotebookLM.

Cloudployees Unleashed: Revolutionizing the Workforce with AI, Docker, TensorFlow, and More – The Future of Cloud-Based Intelligent Employees

AI and ML Technologies: Everything You Need to Know

Step-by-step guide on how to run a LLM locally

A Brief History of Artificial Intelligence

Machine Reasoning taking artificial intelligence from narrow AI to general AI to support Telco

Developing Probably: Building Apps with LLMs

Quantitative Intelligence: embracing the fusion of traditional and AI approaches in asset management

领英推荐

Building an Efficient AI Ingestion Pipeline: Data Ingestion Strategies

2024年10月30日

Building an Efficient AI Ingestion Pipeline: A Guide for Infrastructure Teams on Data Transformation and Vector Database Optimization

2024年10月25日

Private AI Sessions at Explore Barcelona

2024年10月9日

VMware Private AI Foundation with NVIDIA Explore 2024 Sessions

2024年9月2日

Unlocking AI Potential: My VMware Explore 2024 Sessions - A Deep Dive into Private AI, RAG, and Security

2024年8月15日

VMware Private AI Foundation - Privacy and Security Best Practices white paper

2024年6月21日

RAG Architecture Deep Dive

2024年3月19日

Retrieval-Augmented Generation Basics for the Data Center Admin

2024年1月16日

vSphere ML Accelerator Spectrum Deep Dive Series

2023年5月3日

Exciting Sessions at NVIDIA GTC Spring 2023

2023年3月14日