ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Monitoring Generative AI Applications

Drew Robbins

Engineering Leader | Driving Innovation and Observability in Generative AI Applications

å‘å¸ƒæ—¥æœŸ: 2023å¹´9æœˆ19æ—¥

As the adoption of Generative AI applications continues to grow, so does the necessity for observability using robust telemetry. These applications, powered by intricate data models and algorithms, aren't exempt from the challenges faced by any other software system. Yet, their unique nature makes their monitoring requirements distinctive. Generative AI apps interact with a vast array of data, generate varied outputs, and often operate under tight performance constraints. The quality, performance, and efficiency of these applications directly impact user experience and operational costs. Therefore, a structured approach to monitoring and telemetry isn't only beneficial but critical.

Observability offers a real-time lens into an application's health, performance, and functionality. For Generative AI, this means observing the model's accuracy, understanding user interactions, optimizing costs, and more. Telemetry provides the raw data necessary for such monitoring, encompassing everything from logs and traces to specific metrics.

This guide will walk you through the essentials of monitoring Generative AI applications. It will offer a roadmap for capturing, analyzing, and acting on telemetry data to help your AI services run efficiently. We focus on key operational telemetry across the entire Generative AI application.

Why Monitor Generative AI Applications

Generative AI applications are reshaping how industries operate, making them invaluable assets. Yet, without the right monitoring, even the most sophisticated Generative AI application can stumble. Here's why it's paramount to keep a close watch on these systems:

Ensuring model accuracy and reliability: Models evolve, and with evolution can come drifts in accuracy. Continuous monitoring ensures the outputs remain aligned with expectations and standards. Furthermore, as these models learn and adapt, monitoring helps in verifying the consistency and reliability of their predictions.
Detecting anomalies and performance issues: Generative AI can occasionally produce unexpected results or behave erratically due to unforeseen data scenarios or underlying system issues. Monitoring can identify such anomalies, enabling quick mitigation.
Understanding user interactions and feedback: Monitoring user interactions gives insights into how well the application meets user needs. By observing user queries, feedback, and behavior patterns, one can make iterative improvements to enhance the user experience.
Validating costs and optimizing operations: Running AI models, especially at scale, can be resource-intensive. Monitoring provides visibility into resource consumption and operation costs, aiding in optimization and ensuring the most efficient use of available resources.

Basic Concepts in Telemetry

You can read more about the three pillars of observability in my earlier article.

Telemetry is the process of collecting and transmitting data from remote sources to receiving stations for analysis. In the realm of Generative AI applications, telemetry involves capturing key operational data to monitor and improve the system's performance and user experience. Here are some foundational concepts:

Logs: Records of events that occur within an application. For Generative AI, logs can capture information such as user input, model responses, and any errors or exceptions that arise.
Traces: Traces offer a detailed path of a request as it moves through various components of a system. Tracing can be invaluable in understanding the flow of data from embeddings to chat completions, pinpointing bottlenecks, and troubleshooting issues.
Metrics: These are quantitative measures that give insights into the performance, health, and other aspects of a system. In AI, metrics can encompass everything from request rate and error percentages to specific model evaluation measures.

Telemetry serves as the backbone of a well-monitored AI system, offering the insights necessary for continuous improvement.

Logging

In Generative AI applications, logging plays a pivotal role in shedding light on interactions, system behavior, and overall health.

Here are some recommended logs for Generative AI applications:

Requests: This includes tracking parameters such as response times, stop reasons, and specific model parameters to understand both the demand and performance of the system.
Input prompts: Capturing user inputs helps developers grasp how users are engaging with the system, paving the way for potential model refinements.
Model-generated responses: Logging these outputs facilitates auditing and quality checks, ensuring that the model behaves as intended.

Developer teams should collect all errors or anomalies for diagnostic purposes. To control log volume, throttle informational logs should by using a sampling rate or controlled by setting log levels.

Be sure to anonymize and secure sensitive data to uphold user privacy and trust.

Tracing

In Generative AI applications, tracing offers a granular, step-by-step view of a request's journey through the system. Each of these individual steps or operations is a "span." A collection of spans forms a trace that represents the complete path and lifecycle of a request.

Here are the primary spans you might typically see in AI workflows:

API Call Span: This represents the inception and duration of an API request. It provides insights into entry points, initial user intentions, and the overarching time taken for the entire request to process.
Service Processing Span: This covers the time and operations when the request navigates through services. It's especially useful to highlight potential bottlenecks or areas in the system needing optimization.
Model Inference Span: This critical span captures the actual time taken by the AI model to process the input and make a prediction or generate a response. It helps gauge the model's efficiency and performance. These spans can also be updated to capture evaluation metrics, whether user driven or AI driven.
Data Fetching Span: Before model processing, there might be a need to fetch supplementary data from databases or other storage using embeddings or other methods of search. This span traces the duration and operation of that data retrieval and can capture accuracy metrics.

Remember to embed privacy and data protection principles when implementing tracing, to keep user data confidential and stay regulatory compliant.

Metrics

Metrics serve as quantifiable measures that shed light on various performance, health, and usage aspects of the system.

Here are some key metrics for Generative AI applications:

Request Rates (Requests Per Second): This metric provides insights into the load and demand on the system, enabling scalability planning and indicating popular usage times.
Error Rates: Keeping tabs on the percentage of requests that result in errors is essential. A spike in error rates can indicate problems with the model, the infrastructure, or both.
Latency Metrics: These measure the time taken to process a request. Typically, teams segment them into percentiles like P50 (median), P95, and P99 to show the range of experiences users might experience. Monitoring these ensures users receive timely responses.
Model-specific Metrics: Depending on the application, metrics such as BLEU score for translation quality or perplexity for language models might be essential. These offer a gauge of the model's predictive performance.
Token and Cost Metrics: Especially relevant when deploying models in cloud environments, metrics like cost per transaction or API call offer insights into operational expenses. This includes monitoring the number of tokens in prompts and completions, as these can affect costs.

Reviewing and acting upon these metrics facilitates proactive system tuning, ensures user satisfaction, and helps in maintaining cost-efficiency.

é¢†è‹±æŽ¨è

Putting Generative AI To Work Inside The Enterprise

Bernard Marr 9 ä¸ªæœˆå‰

Top 100 AI Development Companies in the USA

Openxcell | Next-Gen AI Services 2 ä¸ªæœˆå‰

Empowering Innovation: How AI Drives Technology Adoption and Diffusion of Innovations | Role of AI in Technology Adoption and Diffusion of Innovation

Empowering Innovation: How AI Drives Technologyâ€¦

Rajoo Jha 1 å¹´å‰

Max Tokens: Max number of tokens specified for a response.
Frequency Penalty: Penalties associated with frequent output tokens.
Presence Penalty: Penalties related to the presence of specific output tokens.
Temperature: Determines the randomness of the model's response.
Model: Identifier for the model version or variant.
Prompt Template: The structure or pattern followed by the user's prompt.
Finish Reason: Indicates why the model finished the response.
Language: Language or locale information for the request.

Operational Tags: These can include details about the application, infrastructure or environment, such as App Id, Server Id, Server Location, or Deployment Version, assisting in pinpointing issues or understanding performance variances between different deployments or regions of the application or the Generative AI services utilized.

User Interaction Tags: While ensuring privacy, tags that give insights into user behavior or type of interaction can be beneficial. For example, Interaction Type could be 'query', 'command', or 'feedback'.

Always be sure that tagging respects user privacy regulations and best practices.

ChatCompletions

ChatCompletions, as a fundamental part of conversational AI, present unique challenges and opportunities in monitoring. These completions, being dynamic and tailored to individual user inputs, can vary widely in quality and relevance. Operational monitoring, therefore, requires specific considerations to make sure the system's effectiveness during live interactions.

Here are some areas of emphasis:

User Satisfaction Metrics:

Session Lengths: Monitoring the duration of user sessions can offer insights into engagement levels. Extended interactions may indicate user satisfaction, while abrupt session ends might hint at issues or frustrations.
Repeat Interactions: Tracking how often users return for multiple sessions can serve as a direct indicator of the perceived value and reliability of the chat system.

Abandoned vs. Completed Interactions: Keeping tabs on interactions where users drop off before receiving or after getting a response can help identify potential pitfalls or shortcomings in the AI's response quality or relevancy. Analyzing reasons for abandonment (whether due to long response times, unsatisfactory answers, or system errors) can provide actionable insights for improvements.

Context Switching Frequencies and Metrics: Context is vital in conversations. Monitoring how often the AI system switches contexts within a session can offer clues about its ability to maintain topic consistency. High context-switching might point to issues in the AI's understanding of user intent or its ability to support a coherent conversational flow.

RAG / Embedding Telemetry

Telemetry for embeddings in an operational context is vital for ensuring that AI systems are providing accurate, relevant, and efficient responses to user queries in real time. Capturing specific metrics related to embeddings can offer actionable insights into the system's behavior during live user interactions.

Here are key metrics tailored to operational telemetry for embeddings:

Distance and Similarity Measures: This metric provides insights into how close related user queries are to the results fetched by the system. Monitoring these measures in real time can help identify if the system is returning highly relevant, diverse, or irrelevant content to users. For instance, consistently close embedding distances for varied user queries might indicate a lack of diversity in results.
Frequency of Specific Embedding Uses: By keeping tabs on which embeddings are accessed most frequently during live interactions, operators can discern current user preferences and system trends. Frequent access to certain embeddings might indicate high relevance and popularity of specific content. On the flip side, rarely accessed embeddings might hint at content that isn't resonating with users or potential issues with the recommendation or search algorithm.

Incorporating telemetry for these embedding metrics in an operational setting facilitates swift adjustments, ensuring users consistently receive relevant and accurate content. Regular reviews of this telemetry also assist in fine-tuning AI systems to better align with evolving user needs and preferences.

Monitoring Infrastructure and Tools

Instrumenting applications with telemetry data is critical for understanding and optimizing system performance. A combination of OpenTelemetry and Azure Monitor provides a comprehensive framework for capturing, processing, and visualizing this telemetry. Here's a breakdown of the components and their functionalities:

OpenTelemetry

Client SDKs: OpenTelemetry offers client SDKs tailored for many programming languages and environments, such as C#, Java, and Python. These SDKs make it easy for developers to seamlessly integrate telemetry collection into their applications.
Collector: Serving as an intermediary, the OpenTelemetry collector orchestrates the telemetry data flow. It consolidates, processes, possibly redacts sensitive data, and then channels this telemetry to designated storage solutions.

Azure Monitor

Metrics: Beyond merely storing metrics, Azure Monitor enriches them with visualization tools and alerting capabilities, ensuring teams are always cognizant of system health and performance.
Traces: Logs and traces ingested by Azure Monitor undergo a detailed analysis, making it simpler to query and dissect the journey of requests/responses within the system. More on Azure Monitor Traces.

Open Source Tools

Prometheus: Renowned for its monitoring capabilities, Prometheus is an open-source system that provides insights by scrutinizing events and metrics. Its versatility allows integration with a range of platforms, including Azure.
Grafana: An open-source platform for monitoring and observability, Grafana meshes flawlessly with both OpenTelemetry and Azure Monitor, offering developers advanced visualization tools they can tailor to specific project needs.
ElasticSearch: A search and analytics engine, ElasticSearch is often chosen by teams who want a scalable search solution combined with log and event data analytics.

Data Analysis and Insights

Effective monitoring is just the first step. Extracting insights from the deluge of data is what drives meaningful improvements. Here's a brief overview of how to harness this data:

Analyze Across Telemetry Types: Dive into logs, traces, and metrics to discern patterns and irregularities. This analysis paves the way for holistic system insights and decision-making.
Automated Alerting: Set up automated alerts that notify the team of anomalies or potential issues, ensuring rapid response and mitigation.
Correlate Metrics: Correlating disparate metrics can unveil deeper insights, spotlighting areas for enhancement that the team might have otherwise overlooked.
Telemetry-driven Feedback Loop: By understanding how models interact with live data and user queries, data scientists and developers can enhance accuracy and user experience.

If you are looking for an end-to-end solution, you can check out tools like Arize and the open-source Phoenix project, which include many of these capabilities out of the box.

Conclusion

Monitoring Generative AI applications isn't just about system health. It's a gateway to refinement, understanding, and evolution. By embedding telemetry and leveraging modern tools, teams can illuminate the intricate workings of their AI systems. This insight, when acted upon, results in applications that aren't only robust and efficient but also aligned with user needs. Embrace these practices to be sure your AI applications are always at the forefront of delivering exceptional value.

If this article was helpful, please subscribe to the newsletter and share it with your network on LinkedIn. I'm always looking for feedback and pointers for my own learning, so please comment if you have ideas or insights.

Observability

595 ä½å…³æ³¨è€…

è®¢é˜…

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Drew Robbinsçš„æ›´å¤šæ–‡ç«

20@Microsoft: How Intentional Relationships Unlock Opportunity

2025å¹´3æœˆ25æ—¥

20@Microsoft: How Intentional Relationships Unlock Opportunity

This month marks 20 years since I joined Microsoft. To reflect on that milestone, Iâ€™m sharing a short series about theâ€¦

3 æ¡è¯„è®º
20@Microsoft: How Unexpected Moments Shaped My Career

2025å¹´3æœˆ24æ—¥

20@Microsoft: How Unexpected Moments Shaped My Career

This month marks 20 years since I joined Microsoft. To reflect on that milestone, Iâ€™m sharing a short series about theâ€¦
Defining Generative AI Monitoring Standards: Whatâ€™s in a Name?

2024å¹´7æœˆ6æ—¥

Defining Generative AI Monitoring Standards: Whatâ€™s in a Name?

We have been doing a lot of Generative AI work lately. Iâ€™m sure many of the readers of this newsletter have as well.
Observing a Greener Future: Carbon Aware SDK

2024å¹´4æœˆ23æ—¥

Observing a Greener Future: Carbon Aware SDK

As software engineers, we're deeply invested in observability to ensure our systems perform optimally and reliablyâ€¦

2 æ¡è¯„è®º
OpenTelemetry Semantic Conventions for Generative AI

2024å¹´4æœˆ17æ—¥

OpenTelemetry Semantic Conventions for Generative AI

Exciting news from our OpenTelemetry working group! We've just merged our first pull-request for OpenTelemetry Semanticâ€¦

4 æ¡è¯„è®º
Why Structured Logging Matters

2024å¹´3æœˆ28æ—¥

Why Structured Logging Matters

I work with many talented individuals at Microsoft, including Maho Pacheco. He recently authored an insightful articleâ€¦

1 æ¡è¯„è®º
Building a Dashboard with Grafana: A First Attempt

2024å¹´1æœˆ15æ—¥

Building a Dashboard with Grafana: A First Attempt

Every year, during the end-of-year holidays I try to do some reading and I try to learn something new. This year, Iâ€¦
Bending OpenAI with Traditional Programming for Unique Recipe Creation

2023å¹´8æœˆ13æ—¥

Bending OpenAI with Traditional Programming for Unique Recipe Creation

Introduction In today's technological landscape, ChatGPT and other Large Language Models (LLM) have captured theâ€¦

1 æ¡è¯„è®º
Let's Code: Building a Custom OpenTelemetry Collector

2023å¹´6æœˆ27æ—¥

Let's Code: Building a Custom OpenTelemetry Collector

In past articles, we explored OpenTelemetry, a powerful tool that shines a light on the internal operations of yourâ€¦

2 æ¡è¯„è®º
Sampling Strategies in Observability

2023å¹´5æœˆ28æ—¥

Sampling Strategies in Observability

Balancing data collection is critical in system monitoring. Collect too much, and you risk an overflow of informationâ€¦

See all articles

Monitoring Generative AI Applications

Drew Robbins

Engineering Leader | Driving Innovation and Observability in Generative AI Applications

Why Monitor Generative AI Applications

Basic Concepts in Telemetry

Logging

Tracing

Metrics

Tags

é¢†è‹±æŽ¨è

ChatCompletions

RAG / Embedding Telemetry

Monitoring Infrastructure and Tools

Data Analysis and Insights

Conclusion

Observability

595 ä½å…³æ³¨è€…

Drew Robbinsçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How You Can Transform Your Industry With Generative AI

Generative AI Application Landscape

Generative AI Predictions For 2024

Generative AI Services vs. Traditional AI â€“ The Intelligent Choice?

Generative AI Insights: May 2024

How Does AutoML Change the Game for AI: The Idea of â€˜AI Creating AIâ€™?

7 use cases of Generative AI emerging in building smarter supply chains

Unlocking the Future: An Introduction to Einstein Generative AI

Key Challenges Occurring in the Deployment Process of AI Applications

The Future of Generative AI: understanding AI?Agents

Why Monitor Generative AI Applications

Basic Concepts in Telemetry

Logging

Tracing

Metrics

Tags

é¢†è‹±æŽ¨è

ChatCompletions

RAG / Embedding Telemetry

Monitoring Infrastructure and Tools

Data Analysis and Insights

Conclusion

Observability

595 ä½å…³æ³¨è€…

Drew Robbinsçš„æ›´å¤šæ–‡ç«

20@Microsoft: How Intentional Relationships Unlock Opportunity

20@Microsoft: How Unexpected Moments Shaped My Career

Defining Generative AI Monitoring Standards: Whatâ€™s in a Name?

Observing a Greener Future: Carbon Aware SDK

OpenTelemetry Semantic Conventions for Generative AI

Why Structured Logging Matters

Building a Dashboard with Grafana: A First Attempt

Bending OpenAI with Traditional Programming for Unique Recipe Creation

Let's Code: Building a Custom OpenTelemetry Collector

Sampling Strategies in Observability

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How You Can Transform Your Industry With Generative AI

Generative AI Application Landscape

Generative AI Predictions For 2024

Generative AI Services vs. Traditional AI â€“ The Intelligent Choice?

Generative AI Insights: May 2024

How Does AutoML Change the Game for AI: The Idea of â€˜AI Creating AIâ€™?

7 use cases of Generative AI emerging in building smarter supply chains

Unlocking the Future: An Introduction to Einstein Generative AI

Key Challenges Occurring in the Deployment Process of AI Applications

The Future of Generative AI: understanding AI?Agents

é¢†è‹±æŽ¨è

595 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†