Responsible LLMOps: Integrating Responsible AI practices into LLMOps
Debmalya Biswas
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
Abstract. While we see growing adoption of both LLMOps & Responsible AI practices in Gen AI implementations, the discussions are often occurring in different communities with a big disconnect between the checklists and their downstream implementation / tooling.? In this article, we aim to bring the two frameworks together with a unified approach to deploying use-cases in a scalable and responsible fashion. We highlight LLM specific issues, deep dive into the relevant LLMOps architectural patterns, and propose solution approaches to establish an integrated LLMOps platform governed by Responsible AI practices.
In collaboration with Dipta Chakraborty Bhargav Mitra
Introduction
We see growing excitement about Large Language Models (LLMs) and their potential to disrupt enterprise use-cases. However, this exploration is mostly limited to Proof-of-Concepts (POCs) today.
To truly scale Gen AI use-cases in the enterprise, there is a need to establish a scalable LLM platform with LLMOps capabilities and the right level of LLM Governance.
In parallel, we also see heightened emphasis on deploying AI Products responsibly as there have been many cases of LLMs providing toxic, biased, sexual, and violent content. LLMs have also been found to hallucinate?—?providing biased and misguided responses.
To mitigate these issues, it is of utmost importance that we train and deploy LLMs responsibly?—?there is a need for Responsible AI practices to be integrated (by design) into LLMOps pipelines.
In this article, we introduce LLMOps and deep dive into the relevant LLMOps architectural patterns. We consider the different Responsible AI dimensions in terms of Data Quality, Data Privacy, Reproducibility and Explainability to enable a well governed LLM Platform. We highlight LLM specific issues and propose solution approaches to establish an integrated LLMOps platform governed by Responsible AI practices?—?leading to the establishment of a Gen AI CoE delivering strategic enterprise use-cases at scale.
MLOps vs?LLMOPs
In an enterprise context, scaling AI/ML use-cases requires enabling MLOps with the right tooling and frameworks. MLOPs is that framework by which one can manage end-to-end ML lifecycle at scale by enabling technology for the development, deployment, monitoring, and ongoing management of ML models.
However, LLMOps(MLOps for LLMs) is very different from MLOPs and it poses many challenges which are difficult to address by MLOPs such as:
LLMOps Architecture patterns
Gen AI solutions are varied in scope, and we expect them to enter the enterprise landscape via different applications / platforms. For instance, this can be via direct usage of a LLM based Application, e.g., ChatGPT; LLMs embedded within a SaaS product or enterprise platform, e.g., Salesforce, ServiceNow; or a foundational model fine-tuned with enterprise data for strategic use-cases. So as a first step, we identify and outline the five most prevalent Gen AI architectural patterns today.
From a Responsible AI perspective, we then identify the key challenges in integrating Responsible AI dimensions (e.g., Reliability, Explainability, Privacy & Security) into the outlined Gen AI architectural patterns.
For instance, explainability becomes more challenging given the large corpus of (generic) training data involved in the case of LLMs. Novel privacy challenges arise given that user inputs can potentially be used as training data to fine-tune the LLMs.
Hallucinations remain the key issue to be addressed given the generative nature of LLMs. To address the above challenges, we propose a consolidated Responsible AI framework for LLM platforms in this article, taking into consideration the best practices and design patterns to integrate necessary governance and guardrails at various stages of the LLMOps pipeline.
Black-box LLM?APIs
This is the classic ChatGPT scenario, where we have black-box access to a LLM API/UI. Similar LLM APIs can be considered for other Natural Language Processing (NLP) core tasks, e.g., Knowledge Retrieval, Summarization, Auto-Correct, Translation, Natural Language Generation (NLG).
Prompts are the primary interaction mechanism here and can encompass user queries and tasks.
Prompts refer to adapting the user input, providing the right context and guidance to the LLM API?—?to maximize the chances of getting the ‘right’ response. It has led to the rise of Prompt Engineering as a professional discipline, where prompt engineers systematically perform trials, recording their findings, to arrive at the ‘right’ prompt to elicit the ‘best’ response.
Embedded LLM?Apps
In this section, we talk about LLMs embedded within enterprise platforms, such as, Salesforce, SAP, ServiceNow; or available as ready-to-use enterprise apps available on the App Stores of LLM providers, such as, Open AI.
Enterprise LLM Apps have the potential to accelerate LLM adoption by providing an enterprise ready solution. However, the same caution needs to be exercised as you would do before using a 3rd party ML model?—?validate LLM/training data ownership, IP, liability clauses.
Data ownership: Data is critical for Supervised AI/ML systems, esp. so for LLMs which are often trained on public datasets, whose data usage rights for AI/ML training are not well defined and can evolve in future. For example, Reddit recently announced that it will start charging for Enterprise AI/ML models learning from its extremely human archives.
Given this, negotiation of ownership issues around not only training data, but input data, output data, and other generated data is critical. On the other hand, it is also important to understand / assess how the Enterprise App Provider will be using the data received / generated as a result of its interactions with the users.
LLM Fine-tuning / Domain specific?SLMs
LLMs are generic in nature. To realize the full potential of LLMs for Enterprises, they need to be contextualized with enterprise knowledge captured in terms of documents, wikis, business processes, etc.
This contextualization is achieved in most cases by fine-tuning a Large Language Model (LLM), with enterprise data, creating a domain specific Small Language Model (SLM).
Fine-tuning entails taking a pre-trained Large Language Model (LLM) and retraining it with (smaller) enterprise data. Technically, this implies updating the weights of the last layer(s) of the trained neural network to reflect the enterprise data and task.
Given this, access to the base model weights is needed to perform fine-tuning, which is not possible for closed models, e.g., ChatGPT. This is where open-source pre-trained LLMs come to the rescue, e.g., Meta AI, who recently open sourced their LLM?—?LLaMA. The Stanford Alpaca project showed that it is possible to fine-tune LLaMA for $600?—?to a model performance comparable with ChatGPT. So, fine-tuning a LLM does not necessarily need to be complex or expensive.
Retrieval Augmented Generation (RAG)
Fine-tuning is a computationally intensive process. RAG provides a viable alternative to fine-tuning by providing additional context with the prompts?—?grounding the retrieval / responses to the given context.
This can be in the form of a set of documents that are first retrieved using an indexed document or vector search, and then provided as context with the prompts to limit the responses. Most LLM Platforms allow today allow prompts to be relatively, so it is possible to embed this enterprise context as part of the prompt.
AI Agents?—?LLM Orchestration
This is the future where enterprises will be able to develop new Enterprise AI Agents by orchestrating / composing [1] multiple existing AI Apps [2].
The discussion around ChatGPT has evolved into AutoGPT. While ChatGPT is primarily a Chatbot that can generate text responses, AutoGPT is a more powerful AI Agent that can execute complex tasks, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza.
AI Agents follow a long history of research around Autonomous Agents, esp., Goal Oriented Agents. A high-level approach to solving such complex tasks involves: (a) decomposition of the given complex task into (a hierarchy or workflow of) simple tasks, followed by (b) composition of agents able to execute the simple(r) tasks.
LangChain is a good example of a mature framework today to compose LLMs.
Responsible AI Framework for?LLMs
The growing adoption of Generative AI, esp. LLMs, has reignited the discussion around AI Regulations?—?to ensure that AI/ML systems are responsibly trained and deployed.
Unfortunately, this effort is complicated by different governmental organizations and regulatory bodies releasing their own guidelines and policies with little to no agreement on the definition of terms.
The specification language is also (sometimes intentionally) kept at such an important level that it is difficult to map them to an implementation / enforcement mechanism. For example, the EU AI Act mandates a different set of dos & don’ts depending on the ‘risk level’ of an AI application. However, quantifying the risk level of an AI application is easier said than done as it depends on multiple parameters, and basically requires one to classify how the capabilities of a non-deterministic system will impact users and systems who might interact with it in the future.
领英推荐
The table below summarizes the key challenges and solutions in implementing Responsible AI for the different Gen AI architectural patterns.
We expand on the above points in the rest of the article to enable an integrated LLMOps pipeline with Responsible LLM Governance.
Data Quality/Reliability
With respect to the data quality/reliability, we need to consider certain dimensions of the quality of the data to enhance the reliability of the overall LLM ecosystem. Majorly, we can speak about these dimensions either during training of a LLM or using the LLM using prompt engineering?—?in both the cases the reliability of the data is of utter importance for enhancing the trust of the LLM. The following data quality checks should be considered while using/training the LLMs:
Use Case
Let us consider a chatbot for an airport to provide live status of flight and ticket availability. In this solution, user can check the flight status as well as the ticket status based on PNR. In this solution, it is extremely important to provide correct feedback to the user.
Without RAG:
User Input: What is the status of PNR 290049223? LLM Output: Upon checking, we can confirm that your booking is confirmed for the upcoming flight ABC0012R.
In the output, the flight mentioned in wrong as well the status. This will mislead the user and create confusion. To tackle these kind of challenges, we can leverage RAG architecture as it will then contain the information regarding PNR and flight status.
With RAG:
Data stored in RAG: PNR 290049223: Alan Smith:Pending:XYH9902K. User Input: What is the status of PNR 290049223?
LLM Output: Upon checking, we can confirm that your booking is in pending status for the upcoming flight XYH9902K.
In this output, LLM provides accurate answer with correct status and flight details.
Model Performance (Reproducibility)
With respect to model performance / reproducibility, it is important to measure the performance of the model during training and inferencing phase to evaluate if the model is performing as per expectation. During the training phase, there are many approaches available to measure the performance of the LLMs which not only ensure the performance of the model during the training phase but also during inferencing phase. We can divide these metrics into two phases: metrics during training phase and for the inferencing phase.
Model Evaluation during Training:
To ensure the performance of the model, it is always right to measure the performance of the model during training phase before deploying it to the production. There are different metrics score available which statistically quantifying the model performance:
Use case
Let us consider an AI Product that provides summary of the user response of a campaign in social media based on the input provided by the Marketing manager.
User Input: Can you please provide the insights where we can do better for the upcoming campaign, especially, for chocolates?
LLM Output: For your chocolate product campaign, highlight its unique flavor profiles through visually appealing and mouth-watering imagery on social media platforms. Consider partnering with popular food bloggers or influencers to create enticing content and reach a wider audience. Offer limited-time promotions or exclusive discounts to create a sense of urgency and drive sales.
For the above example, Perplexity will be low, BLEU and ROGUE will be high.
Model Performance during Inferencing:
Monitoring model performance during live run is an essential aspect of ensuring reliability of the model. Model performance can be monitored in two ways: 1) measure the perfectness of the model outcome and 2) checking model performance in handling request.
Measure the perfectness of the model outcome:
Regular monitoring of the model outcome is essential to make the model more robust and reproducible. Certain metrics like Perplexity, BLEU, ROGUE etc. can be used to measure the model performance. If there is any significant drift in model performance, there is a need to retrain the model. For LLMs, it is essential to retrain a model from the last checkpoint of the trained model, else there will be a chance the model would forget what it learned previously and got trained only on the newly fed data.
Checking model performance in handling request:
Model performance during productionization or handling incoming requests is an important feature to evaluate the model robustness and reproducibility. The different metrics that can be used to evaluate the model performance are:
Model Explainability
Explainability is an umbrella term for a range of tools, algorithms and methods, which accompany AI model inferences with explanations. Explainability and transparency of AI models clearly ranks high among the list of ‘non-functional’ AI features to be considered first by enterprises. For example, this implies having to explain why an ML model profiled a user to be in a specific segment?—?which led him/her to receiving an advertisement.
Chain of Thought (CoT) is a framework that addresses how a LLM is solving a problem. CoT can be implemented using two approaches?—?user prompting and automated approach.
Use Case
Let us consider a chatbot for a medical insurance company where user can check the status of a claim by providing the claim id. The chatbot returns not only the status of the claim but also the clarifications.
User Input: What is the status of my claim id: 123456789?
LLM Output: Upon checking, we can confirm that your claim is approved and 80% of the claim amount is finalized in accordance to the condition mentioned in the policy terms section 1.5.
Data Privacy
With respect to Data Privacy, we need to consider the privacy aspects of enterprise data provided as context (RAGs) and/or enterprise data used to fine-tune the LLMs. In addition, the novel privacy aspect here is consider the privacy risks of data (prompts) provided voluntarily by the end-users, which can potentially be used as training data to re-train/fine-tune the LLMs.
Cloud providers and hyperscalers providing LLMs and enabling their fine-tuning on enterprise data also provide the necessary setup / landing zone for data privacy and controlling access to the data for specific use cases. A detailed discussion of the LLM data privacy controls is beyond the scope of this article.
Conclusion
Gen AI is a disruptive technology, and we are seeing it evolve faster than anything we have experienced before. So, it is important to understand the key architectural patterns?—?the corresponding challenges and solutions in scaling Gen AI POCs to Production. This includes enabling LLMOps to efficiently manage the mix of proprietary, open-source and fine-tuned LLMs in an enterprise. Responsible usage of LLMs also requires assessing how the identified LLMOps architectural patterns impact relevant Responsible AI dimensions.
To overcome these challenges, we have proposed a blueprint towards implementing a Responsible and Governed LLM platform enabled by LLMOps frameworks and tooling in this article. We believe that this will accelerate LLM adoption and enable enterprises to scale Gen AI use-cases in a responsible fashion. This also effectively future-proofs Gen AI investments and ensures that the LLM platform will be able to cope as the Gen AI landscape evolves, new LLMs, and new training / deployment architectures, etc. emerge in the future.
References
Founder & CEO at Cydra Tech | AI & Data solutions | Doctor
5 个月Debmalya Biswas Exactly what I needed to know. My team is actually building an Open Source LLMOps tool for better Gen AI responses and would love to have a domain expert like you give an opinion on it ??
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
5 个月This is RAGs in real-time -:) how a ChatGPT summary of the article looks like. While we share the articles on LinkedIn so that it reaches maximum ppl, plagiarism should not be encouraged by publishing sites :( Thankfully, ChatGPT still cannot process images, otherwise it would have been an even better copy. https://dev.to/emma_in_tech/bridging-the-gap-integrating-responsible-ai-practices-into-scalable-llmops-for-enterprise-excellence-19k3
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
5 个月The article has now also been accepted in Towards Data Science https://towardsdatascience.com/responsible-llmops-985cd1af3639
AI Business Automation & Workflows | Superior Website Creation & Maintenance | Podcast
5 个月Terrific post, Debmalya! I'm curious about one aspect. Could introducing AI responsibly impact the speed of deploying products at scale? From my experience at Belov Digital, we often encounter trade-offs between speed, quality, and responsibility.