Responsible LLMOps: Integrating Responsible AI practices into LLMOps
Fig: Responsible AI integrated with LLMOps

Responsible LLMOps: Integrating Responsible AI practices into LLMOps

Abstract. While we see growing adoption of both LLMOps & Responsible AI practices in Gen AI implementations, the discussions are often occurring in different communities with a big disconnect between the checklists and their downstream implementation / tooling.? In this article, we aim to bring the two frameworks together with a unified approach to deploying use-cases in a scalable and responsible fashion. We highlight LLM specific issues, deep dive into the relevant LLMOps architectural patterns, and propose solution approaches to establish an integrated LLMOps platform governed by Responsible AI practices.

In collaboration with Dipta Chakraborty Bhargav Mitra

Introduction

We see growing excitement about Large Language Models (LLMs) and their potential to disrupt enterprise use-cases. However, this exploration is mostly limited to Proof-of-Concepts (POCs) today.

To truly scale Gen AI use-cases in the enterprise, there is a need to establish a scalable LLM platform with LLMOps capabilities and the right level of LLM Governance.

In parallel, we also see heightened emphasis on deploying AI Products responsibly as there have been many cases of LLMs providing toxic, biased, sexual, and violent content. LLMs have also been found to hallucinate?—?providing biased and misguided responses.

To mitigate these issues, it is of utmost importance that we train and deploy LLMs responsibly?—?there is a need for Responsible AI practices to be integrated (by design) into LLMOps pipelines.

In this article, we introduce LLMOps and deep dive into the relevant LLMOps architectural patterns. We consider the different Responsible AI dimensions in terms of Data Quality, Data Privacy, Reproducibility and Explainability to enable a well governed LLM Platform. We highlight LLM specific issues and propose solution approaches to establish an integrated LLMOps platform governed by Responsible AI practices?—?leading to the establishment of a Gen AI CoE delivering strategic enterprise use-cases at scale.

MLOps vs?LLMOPs

In an enterprise context, scaling AI/ML use-cases requires enabling MLOps with the right tooling and frameworks. MLOPs is that framework by which one can manage end-to-end ML lifecycle at scale by enabling technology for the development, deployment, monitoring, and ongoing management of ML models.

However, LLMOps(MLOps for LLMs) is very different from MLOPs and it poses many challenges which are difficult to address by MLOPs such as:

  • Unstructured data: Supervised (Predictive) ML primarily deals with structured data, in the form of labeled relational data, time series data, etc. on which models are trained. In the case of LLMs, we are mostly dealing with unstructured data, documents, multi-modal data consisting of text, images, audio, video, files.
  • Pre-trained foundational LLMs: Instead of training ML models from scratch, the most prevalent scenario is to fine-tune pre-trained foundational LLMs trained on a large corpus of generic data.
  • The generative nature of LLMs implies that new content is generated in real-time based on user prompts/responses. This leads to the potential of user inputs being used as training data and guardrails to prevent the generated responses from hallucinating.
  • (Human) Feedback loops are an essential part of the training (and continuous improvement) of LLMs to improve the quality of the responses. Reinforcement Learning helps to perform this improvement in a targeted fashion by leveraging a Rewards strategy, e.g., Proximal Policy Optimization, to fine-tune the responses.

LLMOps Architecture patterns

Gen AI solutions are varied in scope, and we expect them to enter the enterprise landscape via different applications / platforms. For instance, this can be via direct usage of a LLM based Application, e.g., ChatGPT; LLMs embedded within a SaaS product or enterprise platform, e.g., Salesforce, ServiceNow; or a foundational model fine-tuned with enterprise data for strategic use-cases. So as a first step, we identify and outline the five most prevalent Gen AI architectural patterns today.

From a Responsible AI perspective, we then identify the key challenges in integrating Responsible AI dimensions (e.g., Reliability, Explainability, Privacy & Security) into the outlined Gen AI architectural patterns.

For instance, explainability becomes more challenging given the large corpus of (generic) training data involved in the case of LLMs. Novel privacy challenges arise given that user inputs can potentially be used as training data to fine-tune the LLMs.

Hallucinations remain the key issue to be addressed given the generative nature of LLMs. To address the above challenges, we propose a consolidated Responsible AI framework for LLM platforms in this article, taking into consideration the best practices and design patterns to integrate necessary governance and guardrails at various stages of the LLMOps pipeline.

Black-box LLM?APIs

This is the classic ChatGPT scenario, where we have black-box access to a LLM API/UI. Similar LLM APIs can be considered for other Natural Language Processing (NLP) core tasks, e.g., Knowledge Retrieval, Summarization, Auto-Correct, Translation, Natural Language Generation (NLG).

Prompts are the primary interaction mechanism here and can encompass user queries and tasks.

Prompts refer to adapting the user input, providing the right context and guidance to the LLM API?—?to maximize the chances of getting the ‘right’ response. It has led to the rise of Prompt Engineering as a professional discipline, where prompt engineers systematically perform trials, recording their findings, to arrive at the ‘right’ prompt to elicit the ‘best’ response.

Fig: LLM APIs

Embedded LLM?Apps

In this section, we talk about LLMs embedded within enterprise platforms, such as, Salesforce, SAP, ServiceNow; or available as ready-to-use enterprise apps available on the App Stores of LLM providers, such as, Open AI.

Enterprise LLM Apps have the potential to accelerate LLM adoption by providing an enterprise ready solution. However, the same caution needs to be exercised as you would do before using a 3rd party ML model?—?validate LLM/training data ownership, IP, liability clauses.

Data ownership: Data is critical for Supervised AI/ML systems, esp. so for LLMs which are often trained on public datasets, whose data usage rights for AI/ML training are not well defined and can evolve in future. For example, Reddit recently announced that it will start charging for Enterprise AI/ML models learning from its extremely human archives.

Given this, negotiation of ownership issues around not only training data, but input data, output data, and other generated data is critical. On the other hand, it is also important to understand / assess how the Enterprise App Provider will be using the data received / generated as a result of its interactions with the users.

Fig: LLM apps embedded within Enterprise Apps / Platforms

LLM Fine-tuning / Domain specific?SLMs

LLMs are generic in nature. To realize the full potential of LLMs for Enterprises, they need to be contextualized with enterprise knowledge captured in terms of documents, wikis, business processes, etc.

This contextualization is achieved in most cases by fine-tuning a Large Language Model (LLM), with enterprise data, creating a domain specific Small Language Model (SLM).
Fig: Enterprise LLM contextualization strategy

Fine-tuning entails taking a pre-trained Large Language Model (LLM) and retraining it with (smaller) enterprise data. Technically, this implies updating the weights of the last layer(s) of the trained neural network to reflect the enterprise data and task.

Given this, access to the base model weights is needed to perform fine-tuning, which is not possible for closed models, e.g., ChatGPT. This is where open-source pre-trained LLMs come to the rescue, e.g., Meta AI, who recently open sourced their LLM?—?LLaMA. The Stanford Alpaca project showed that it is possible to fine-tune LLaMA for $600?—?to a model performance comparable with ChatGPT. So, fine-tuning a LLM does not necessarily need to be complex or expensive.

Fig: LLM fine-tuning with Enterprise data

Retrieval Augmented Generation (RAG)

Fine-tuning is a computationally intensive process. RAG provides a viable alternative to fine-tuning by providing additional context with the prompts?—?grounding the retrieval / responses to the given context.

This can be in the form of a set of documents that are first retrieved using an indexed document or vector search, and then provided as context with the prompts to limit the responses. Most LLM Platforms allow today allow prompts to be relatively, so it is possible to embed this enterprise context as part of the prompt.

Fig: Retrieval Augmented Generation — RAGs

AI Agents?—?LLM Orchestration

This is the future where enterprises will be able to develop new Enterprise AI Agents by orchestrating / composing [1] multiple existing AI Apps [2].

The discussion around ChatGPT has evolved into AutoGPT. While ChatGPT is primarily a Chatbot that can generate text responses, AutoGPT is a more powerful AI Agent that can execute complex tasks, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza.

Fig: AI Agents — LLM Orchestration

AI Agents follow a long history of research around Autonomous Agents, esp., Goal Oriented Agents. A high-level approach to solving such complex tasks involves: (a) decomposition of the given complex task into (a hierarchy or workflow of) simple tasks, followed by (b) composition of agents able to execute the simple(r) tasks.

LangChain is a good example of a mature framework today to compose LLMs.

Responsible AI Framework for?LLMs

The growing adoption of Generative AI, esp. LLMs, has reignited the discussion around AI Regulations?—?to ensure that AI/ML systems are responsibly trained and deployed.

Unfortunately, this effort is complicated by different governmental organizations and regulatory bodies releasing their own guidelines and policies with little to no agreement on the definition of terms.

The specification language is also (sometimes intentionally) kept at such an important level that it is difficult to map them to an implementation / enforcement mechanism. For example, the EU AI Act mandates a different set of dos & don’ts depending on the ‘risk level’ of an AI application. However, quantifying the risk level of an AI application is easier said than done as it depends on multiple parameters, and basically requires one to classify how the capabilities of a non-deterministic system will impact users and systems who might interact with it in the future.

The table below summarizes the key challenges and solutions in implementing Responsible AI for the different Gen AI architectural patterns.

Fig: Responsible AI integrated with LLMOps

We expand on the above points in the rest of the article to enable an integrated LLMOps pipeline with Responsible LLM Governance.

Data Quality/Reliability

With respect to the data quality/reliability, we need to consider certain dimensions of the quality of the data to enhance the reliability of the overall LLM ecosystem. Majorly, we can speak about these dimensions either during training of a LLM or using the LLM using prompt engineering?—?in both the cases the reliability of the data is of utter importance for enhancing the trust of the LLM. The following data quality checks should be considered while using/training the LLMs:

  • Data Consistency: The data used for training (esp., fine-tuning) the LLM should be accurate and precise, which means the relevant data pertaining to the specific use-case should be used to train the LLMs, e.g. if the use case is to generate summary of a medical prescription?—?the user should not use other data like Q&A of a diagnosis, user must use only medical prescriptions and corresponding summarization of the prescription. Extra caution should be exercised when incorporating time-related data, ensuring mindfulness of frequency and temporal periods. Many a times, data pipelines need to be created to ingest the data and feed that to LLMs. In such scenarios, extra caution needs to be exercised to consume the running text fields as these fields hold mostly inconsistent and incorrect data.
  • Bias/Fairness: With respect to model performance and reliability, it is difficult to control undesired biases in black box LLMs, though it can be controlled to some extent by using fair and unbiased data to fine-tune the LLMs and/or contextualize the LLMs in a RAG architecture.
  • Accountability: To make LLMs more reliable, it is recommended to validate the LLMs’ output by humans. Involving humans ensures if LLMs hallucinate or provide wrong response, a human can evaluate and make the necessary corrections.
  • Hallucination: In case of using LLM APIs or orchestrating multiple LLMs, hallucination likelihood increases as there are no guardrails. The right prompts can help but only to a limited extent. To further limit the hallucination, LLMs need to be fine-tuned with curated data and/or limit the search space of responses to relevant and recent enterprise data.

Use Case

Let us consider a chatbot for an airport to provide live status of flight and ticket availability. In this solution, user can check the flight status as well as the ticket status based on PNR. In this solution, it is extremely important to provide correct feedback to the user.

Without RAG:

User Input: What is the status of PNR 290049223? LLM Output: Upon checking, we can confirm that your booking is confirmed for the upcoming flight ABC0012R.

In the output, the flight mentioned in wrong as well the status. This will mislead the user and create confusion. To tackle these kind of challenges, we can leverage RAG architecture as it will then contain the information regarding PNR and flight status.

With RAG:

Data stored in RAG: PNR 290049223: Alan Smith:Pending:XYH9902K. User Input: What is the status of PNR 290049223?

LLM Output: Upon checking, we can confirm that your booking is in pending status for the upcoming flight XYH9902K.

In this output, LLM provides accurate answer with correct status and flight details.

Model Performance (Reproducibility)

With respect to model performance / reproducibility, it is important to measure the performance of the model during training and inferencing phase to evaluate if the model is performing as per expectation. During the training phase, there are many approaches available to measure the performance of the LLMs which not only ensure the performance of the model during the training phase but also during inferencing phase. We can divide these metrics into two phases: metrics during training phase and for the inferencing phase.

Model Evaluation during Training:

To ensure the performance of the model, it is always right to measure the performance of the model during training phase before deploying it to the production. There are different metrics score available which statistically quantifying the model performance:

  • Perplexity: This quantifies how a model predicts a text?—?lower the score better the model.
  • BLEU (Bilingual Evaluation Understudy): BLEU is a metric commonly used in machine translation tasks. It compares the generated output with one or more reference translations and measures the similarity between them. The higher the score better the model.
  • ROUGE (Recall-Oriented Understudy for Gissing Evaluation): ROUGE is a set of metrics used for evaluating the quality of summaries. It compares the generated summary with one or more reference summaries and calculates precision, recall, and F1-score. The higher the F1-Score, the better the model.
  • Human Evaluation: Last but not least, human evaluation or manual testing of the model is essential to check the performance of the model for different scenarios.

Use case

Let us consider an AI Product that provides summary of the user response of a campaign in social media based on the input provided by the Marketing manager.

User Input: Can you please provide the insights where we can do better for the upcoming campaign, especially, for chocolates?

LLM Output: For your chocolate product campaign, highlight its unique flavor profiles through visually appealing and mouth-watering imagery on social media platforms. Consider partnering with popular food bloggers or influencers to create enticing content and reach a wider audience. Offer limited-time promotions or exclusive discounts to create a sense of urgency and drive sales.

For the above example, Perplexity will be low, BLEU and ROGUE will be high.

Model Performance during Inferencing:

Monitoring model performance during live run is an essential aspect of ensuring reliability of the model. Model performance can be monitored in two ways: 1) measure the perfectness of the model outcome and 2) checking model performance in handling request.

Measure the perfectness of the model outcome:

Regular monitoring of the model outcome is essential to make the model more robust and reproducible. Certain metrics like Perplexity, BLEU, ROGUE etc. can be used to measure the model performance. If there is any significant drift in model performance, there is a need to retrain the model. For LLMs, it is essential to retrain a model from the last checkpoint of the trained model, else there will be a chance the model would forget what it learned previously and got trained only on the newly fed data.

Checking model performance in handling request:

Model performance during productionization or handling incoming requests is an important feature to evaluate the model robustness and reproducibility. The different metrics that can be used to evaluate the model performance are:

  • Completed requests per minute: This measures, how many requests the model dealt with in a certain period.
  • Time to first token (TTFT): TTFT is the time before LLM return the first token, in streaming applications this is a critical metric.
  • Inter-token latency (ITL): It is the average time between consecutive tokens.
  • End-to-end Latency: The end-to-end latency should approximately be the same as the average output length of tokens multiplied by the inter-token latency.

Model Explainability

Explainability is an umbrella term for a range of tools, algorithms and methods, which accompany AI model inferences with explanations. Explainability and transparency of AI models clearly ranks high among the list of ‘non-functional’ AI features to be considered first by enterprises. For example, this implies having to explain why an ML model profiled a user to be in a specific segment?—?which led him/her to receiving an advertisement.

Chain of Thought (CoT) is a framework that addresses how a LLM is solving a problem. CoT can be implemented using two approaches?—?user prompting and automated approach.

  • User Prompting: Here, during prompting, user provides the logic about how to approach a certain problem and LLM will solve similar problem using same logic and return the output along with the logic.
  • Automating Chain of Thought Prompting: Manual handcrafting CoT can be time consuming and provide sub optimal solution, Automatic Chain of Thought (Auto-CoT) can be leveraged to generate the reasoning chains automatically thus eliminating the human intervention. Auto-CoT basically works on two processes: 1. Question Clustering: Cluster the questions of a given dataset. 2. Demonstration Sampling: Select the representative question from each cluster and generate the reasoning chain using zero shot CoT. Auto-CoT works well for LLMs having approximately 100B parameters but not so accurate for the small scale LLMs.

Use Case

Let us consider a chatbot for a medical insurance company where user can check the status of a claim by providing the claim id. The chatbot returns not only the status of the claim but also the clarifications.

User Input: What is the status of my claim id: 123456789?

LLM Output: Upon checking, we can confirm that your claim is approved and 80% of the claim amount is finalized in accordance to the condition mentioned in the policy terms section 1.5.

Data Privacy

With respect to Data Privacy, we need to consider the privacy aspects of enterprise data provided as context (RAGs) and/or enterprise data used to fine-tune the LLMs. In addition, the novel privacy aspect here is consider the privacy risks of data (prompts) provided voluntarily by the end-users, which can potentially be used as training data to re-train/fine-tune the LLMs.

Cloud providers and hyperscalers providing LLMs and enabling their fine-tuning on enterprise data also provide the necessary setup / landing zone for data privacy and controlling access to the data for specific use cases. A detailed discussion of the LLM data privacy controls is beyond the scope of this article.

Conclusion

Gen AI is a disruptive technology, and we are seeing it evolve faster than anything we have experienced before. So, it is important to understand the key architectural patterns?—?the corresponding challenges and solutions in scaling Gen AI POCs to Production. This includes enabling LLMOps to efficiently manage the mix of proprietary, open-source and fine-tuned LLMs in an enterprise. Responsible usage of LLMs also requires assessing how the identified LLMOps architectural patterns impact relevant Responsible AI dimensions.

To overcome these challenges, we have proposed a blueprint towards implementing a Responsible and Governed LLM platform enabled by LLMOps frameworks and tooling in this article. We believe that this will accelerate LLM adoption and enable enterprises to scale Gen AI use-cases in a responsible fashion. This also effectively future-proofs Gen AI investments and ensures that the LLM platform will be able to cope as the Gen AI landscape evolves, new LLMs, and new training / deployment architectures, etc. emerge in the future.

References

  1. D. Biswas. Compositional AI: The Future of Enterprise AI. In Towards Data Science, https://towardsdatascience.com/compositional-ai-the-future-of-enterprise-ai-3d5289dfa888
  2. D. Biswas. Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking. In proc. of the 16th International Conference on Agents and Artificial Intelligence (ICAART), 2024.

Akshay SP

Founder & CEO at Cydra Tech | AI & Data solutions | Doctor

5 个月

Debmalya Biswas Exactly what I needed to know. My team is actually building an Open Source LLMOps tool for better Gen AI responses and would love to have a domain expert like you give an opinion on it ??

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

5 个月

This is RAGs in real-time -:) how a ChatGPT summary of the article looks like. While we share the articles on LinkedIn so that it reaches maximum ppl, plagiarism should not be encouraged by publishing sites :( Thankfully, ChatGPT still cannot process images, otherwise it would have been an even better copy. https://dev.to/emma_in_tech/bridging-the-gap-integrating-responsible-ai-practices-into-scalable-llmops-for-enterprise-excellence-19k3

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

5 个月
Alex Belov

AI Business Automation & Workflows | Superior Website Creation & Maintenance | Podcast

5 个月

Terrific post, Debmalya! I'm curious about one aspect. Could introducing AI responsibly impact the speed of deploying products at scale? From my experience at Belov Digital, we often encounter trade-offs between speed, quality, and responsibility.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了