登录查看更多内容

A Practitioner's Guide to Monitoring Machine Learning Applications

AI Guild

You can join as a member via 'Sign up' below.

发布日期: 2022年10月25日

How to prioritize your monitoring efforts while avoiding alert fatigue

The machine learning monitoring landscape is evolving fast. You may be tempted to use the latest tool and hope it works out of the box. However, this could lead to receiving many false alerts or missing issues.

In this article, I present an easy-to-implement prioritization approach that you can use with either your own backend monitoring or a vendor monitoring tool. It is based on more than 30 large-scale models I have run in production over the last ten years.

Note: As the image below shows, machine learning monitoring should be added on top of typical backend monitoring. For engineers and data scientists without production experience, see this article, which provides a hands-on introduction to backend monitoring.

Diagram of ML Monitoring being added on top of traditional devops

Traditional Software Monitoring is not Sufficient for Machine Learning Applications

The Added complexity of ML Ops over DevOps (Source)

When you apply only traditional backend monitoring to machine learning applications, you will experience silent failures. These failures have a massive negative impact on the quality of your application's response, adversely affecting user experience or the company's revenue.

Some examples of silent failures I've personally observed are

Changes in input data: The client changed the unit in a fraud model from seconds to milliseconds.
Business rules are overly aggressive: A new filter rule is unexpectedly aggressive. A rule works well when it is created but is too aggressive during the sale season.
Bugs in our code: "get last ten orders" vs. "last ten bought articles"
Model performance: The model was automatically trained and released, but the performance was worse
Dependency updates: We got a faulty version of Tensorflow because we failed to pin the version.
Client changes how the product works without notifying us: The wishlist no longer requires users to be logged in.

These examples are by no means exhaustive, but they do highlight the need for additional monitoring. To make matters worse, many of these bugs are permanent, degrading our performance by as much as 10 or 15%, while only massive problems (>50% worse) will be detected by users or stakeholders.

How to prioritize monitoring for impact and avoid alert fatigue

"We introduced a machine learning observability tool, and now we get several alerts each week that an input field's distribution has changed. The reasons are mostly upstream business changes or unexplained changes in the input data. We did not take any action on the alerts."

Source: Data scientist working at a market-leading financing platform

Alerts that don't prompt clear actions will be ignored. Many tools and vendors offer input data monitoring, and while monitoring input data is valuable, I advise against it as a first step or a standalone measure.

Instead, I advocate taking a page out of site reliability engineering's book and recommend focusing on customer impact-based metrics. You prioritize backward from the output:

I will cover the top priorities in this article. For a discussion of the remaining steps, please see my presentation from #datalift22.

Measure Evaluation Metrics in Production

For some machine learning applications, you get to know the actual value of your prediction, usually with a delay.

For example: predicting the delivery time of food.

After the food arrives, you can compare your prediction to the actual observed value. The metrics are then calculated over many examples. You can compare them to metrics measured on historical data during model development.

To monitor the evaluation metrics in production, take the following steps:

领英推荐

How to Build a Robust Data Collection Pipeline for…

Objectways 5 个月前

How MLOps Improves the Lifecycle of Machine Learning…

DesiCrew Solutions Private Limited 4 个月前

?? Unveiling the Potential of Feature Stores in…

DareData 11 个月前

Store the prediction for each request and later the observed actual value.?
Run a job that joins predictions and actual values and calculates the same metrics used during the model training and evaluation. Schedule the job every 10 mins, hourly or daily; shorter is better for real-time detection.
Add metrics to a dashboard and create an alert (see the article on Backend Monitoring Basics).?

Monitor the distribution of your response

You should also monitor your application's response distribution. The response is the return value after all postprocessing steps and business rules. For classification models, this can be a prediction score. For regression models, it is also a numerical value.

The response value is an excellent proxy for quality monitoring. It does not measure how well the model fits its target function, like evaluation metrics. However, it does change when the quality deteriorates (e.g., an aggressive filter removes many high-quality predictions, an important input variable changes the output score drastically).

Measuring the response distribution offers many significant benefits. For instance, it is:

available in real-time, allowing for fast detection, which is especially important during deployment
easy to collect compared to the actual outcome or downstream user interactions like clicks
less statistically noisy. For example, click rate reduction is only detectable for massive problems, while shifts in the output score are detectable for more minor changes like 5 or 10%

So how do you collect the response value?

Create a histogram from the request's returned scores using Prometheus histogram or your preferred metrics library. Monitor all traffic and important segments by adding labels to your metrics (i.e., the country, customer segments, browser).
Display the histograms over time. Choose a quantile like 80% to 95% depending on how much traffic you have and how stable it is. If you prefer to compare the whole distribution instead of a single quantile, you can use distribution comparison metrics like Google's D1 metric.
Set an alert for the metric.

Bonus tip: Monitor negative user experiences in a separate metric (e.g., your service returns empty, a low certainty response, or a fallback). Brainstorm proxies for your use case. Create an alert on the percentage of bad responses. Downside monitoring is critical, so don't wait until your stakeholders or users notify you.

The Limit of Today's Machine Learning Monitoring

It is worth mentioning that today's machine learning monitoring methods will not alert you to all individual bad predictions. It works instead on the whole traffic or on segments. If you run anything where even a single failure is potentially catastrophic, like health-related predictions, consider measures like easy-to-find objection mechanisms for end-users, partial automation over full automation, and humans in the loop.

Key Takeaways

Prioritize monitoring output metrics (user impact!) like response monitoring and evaluation metrics in production.
You can use existing backend monitoring tools to get started and invest in a vendor tool when they are at the maturity you need.

About the Author

Lina Weichbrodt is a machine learning consultant with 10+ years of experience developing scalable machine learning models for millions of users and running them in production. Follow her on LinkedIn for more insights.

AI Guild Announcements

Take the Survey: How Many Use Cases in Production?
Apply to be a Speaker at the #datalift Summit 2023
Join the Hamburg AI Guild dinner?on 03 November

Not a member of the AI Guild yet?

Apply online at?https://www.theguild.ai/.

Do you have a use case you would like to share??

Pitch it to us here!

Special thanks to?Evan Simpson?for acting as editor of?Deploy It Already.??

A Practitioner's Guide to Monitoring Machine Learning Applications

AI Guild

You can join as a member via 'Sign up' below.

How to prioritize your monitoring efforts while avoiding alert fatigue

Traditional Software Monitoring is not Sufficient for Machine Learning Applications

How to prioritize monitoring for impact and avoid alert fatigue

Measure Evaluation Metrics in Production

领英推荐

Monitor the distribution of your response

The Limit of Today's Machine Learning Monitoring

Key Takeaways

AI Guild Announcements

Not a member of the AI Guild yet?

Do you have a use case you would like to share??

Deploy It Already

1,986 位关注者

AI Guild的更多文章

社区洞察

其他会员也浏览了

Data Preprocessing Techniques for Machine Learning

MLOps: Bridging the Gap Between Innovation and Reality

MLOps: Managing Machine Learning Pipelines from Development to Production

MlOps, a must!

The Role of a Minimum Viable Product (MVP) in Successful Machine Learning Projects

Modelops 2022: the state of practice

The Evolution of Machine Learning: The Birth of MLOps

Key Observability Practices for SRE in Large-Scale AI Systems

Defining the Differences between MLOps, ModelOps, DataOps & AIOps

Data Transformation Challenges: Master the Art of Data Partitioning for Ultimate AI and ML Training Success!

How to prioritize your monitoring efforts while avoiding alert fatigue

Traditional Software Monitoring is not Sufficient for Machine Learning Applications

How to prioritize monitoring for impact and avoid alert fatigue

Measure Evaluation Metrics in Production

领英推荐

Monitor the distribution of your response

The Limit of Today's Machine Learning Monitoring

Key Takeaways

AI Guild Announcements

Not a member of the AI Guild yet?

Do you have a use case you would like to share??

Deploy It Already

1,986 位关注者

AI Guild的更多文章

Use Cases in Production

The AI Metamorphosis of Software Engineering

#datalift Summit 2023: The first 4 Use Cases

#datalift Summit 2023: The first 4 Keynotes

Machine Learning Workflows in Production

Machine Learning Workflows in Production

Raising ML Standards in a Big Organization

Use Cases in Production?

The Continuing Evolution of Event Tracking at GetYourGuide

Deploying the First Neural Network in Energy Transmission Service

社区洞察

其他会员也浏览了

Data Preprocessing Techniques for Machine Learning

MLOps: Bridging the Gap Between Innovation and Reality

MLOps: Managing Machine Learning Pipelines from Development to Production

MlOps, a must!

The Role of a Minimum Viable Product (MVP) in Successful Machine Learning Projects

Modelops 2022: the state of practice

The Evolution of Machine Learning: The Birth of MLOps

Key Observability Practices for SRE in Large-Scale AI Systems

Defining the Differences between MLOps, ModelOps, DataOps & AIOps

Data Transformation Challenges: Master the Art of Data Partitioning for Ultimate AI and ML Training Success!