A Practitioner's Guide to Monitoring Machine Learning Applications
Image by Evan Simpson

A Practitioner's Guide to Monitoring Machine Learning Applications

How to prioritize your monitoring efforts while avoiding alert fatigue

The machine learning monitoring landscape is evolving fast. You may be tempted to use the latest tool and hope it works out of the box. However, this could lead to receiving many false alerts or missing issues.

In this article, I present an easy-to-implement prioritization approach that you can use with either your own backend monitoring or a vendor monitoring tool. It is based on more than 30 large-scale models I have run in production over the last ten years.

Note: As the image below shows, machine learning monitoring should be added on top of typical backend monitoring. For engineers and data scientists without production experience, see this article, which provides a hands-on introduction to backend monitoring.

Diagram of ML Monitoring being added on top of traditional devops

Traditional Software Monitoring is not Sufficient for Machine Learning Applications

No alt text provided for this image

The Added complexity of ML Ops over DevOps (Source)

When you apply only traditional backend monitoring to machine learning applications, you will experience silent failures. These failures have a massive negative impact on the quality of your application's response, adversely affecting user experience or the company's revenue.

Some examples of silent failures I've personally observed are

  • Changes in input data: The client changed the unit in a fraud model from seconds to milliseconds.
  • Business rules are overly aggressive: A new filter rule is unexpectedly aggressive. A rule works well when it is created but is too aggressive during the sale season.
  • Bugs in our code: "get last ten orders" vs. "last ten bought articles"
  • Model performance: The model was automatically trained and released, but the performance was worse
  • Dependency updates: We got a faulty version of Tensorflow because we failed to pin the version.
  • Client changes how the product works without notifying us: The wishlist no longer requires users to be logged in.

These examples are by no means exhaustive, but they do highlight the need for additional monitoring. To make matters worse, many of these bugs are permanent, degrading our performance by as much as 10 or 15%, while only massive problems (>50% worse) will be detected by users or stakeholders.

How to prioritize monitoring for impact and avoid alert fatigue

"We introduced a machine learning observability tool, and now we get several alerts each week that an input field's distribution has changed. The reasons are mostly upstream business changes or unexplained changes in the input data. We did not take any action on the alerts."

Source: Data scientist working at a market-leading financing platform

Alerts that don't prompt clear actions will be ignored. Many tools and vendors offer input data monitoring, and while monitoring input data is valuable, I advise against it as a first step or a standalone measure.

Instead, I advocate taking a page out of site reliability engineering's book and recommend focusing on customer impact-based metrics. You prioritize backward from the output:

No alt text provided for this image

I will cover the top priorities in this article. For a discussion of the remaining steps, please see my presentation from #datalift22.

Measure Evaluation Metrics in Production

For some machine learning applications, you get to know the actual value of your prediction, usually with a delay.

For example: predicting the delivery time of food.

After the food arrives, you can compare your prediction to the actual observed value. The metrics are then calculated over many examples. You can compare them to metrics measured on historical data during model development.

To monitor the evaluation metrics in production, take the following steps:

  • Store the prediction for each request and later the observed actual value.?
  • Run a job that joins predictions and actual values and calculates the same metrics used during the model training and evaluation. Schedule the job every 10 mins, hourly or daily; shorter is better for real-time detection.
  • Add metrics to a dashboard and create an alert (see the article on Backend Monitoring Basics).?

No alt text provided for this image

Monitor the distribution of your response

You should also monitor your application's response distribution. The response is the return value after all postprocessing steps and business rules. For classification models, this can be a prediction score. For regression models, it is also a numerical value.

The response value is an excellent proxy for quality monitoring. It does not measure how well the model fits its target function, like evaluation metrics. However, it does change when the quality deteriorates (e.g., an aggressive filter removes many high-quality predictions, an important input variable changes the output score drastically).

Measuring the response distribution offers many significant benefits. For instance, it is:

  • available in real-time, allowing for fast detection, which is especially important during deployment
  • easy to collect compared to the actual outcome or downstream user interactions like clicks
  • less statistically noisy. For example, click rate reduction is only detectable for massive problems, while shifts in the output score are detectable for more minor changes like 5 or 10%

So how do you collect the response value?

  • Create a histogram from the request's returned scores using Prometheus histogram or your preferred metrics library. Monitor all traffic and important segments by adding labels to your metrics (i.e., the country, customer segments, browser).
  • Display the histograms over time. Choose a quantile like 80% to 95% depending on how much traffic you have and how stable it is. If you prefer to compare the whole distribution instead of a single quantile, you can use distribution comparison metrics like Google's D1 metric.
  • Set an alert for the metric.

Bonus tip: Monitor negative user experiences in a separate metric (e.g., your service returns empty, a low certainty response, or a fallback). Brainstorm proxies for your use case. Create an alert on the percentage of bad responses. Downside monitoring is critical, so don't wait until your stakeholders or users notify you.

The Limit of Today's Machine Learning Monitoring

It is worth mentioning that today's machine learning monitoring methods will not alert you to all individual bad predictions. It works instead on the whole traffic or on segments. If you run anything where even a single failure is potentially catastrophic, like health-related predictions, consider measures like easy-to-find objection mechanisms for end-users, partial automation over full automation, and humans in the loop.

Key Takeaways

  • Prioritize monitoring output metrics (user impact!) like response monitoring and evaluation metrics in production.
  • You can use existing backend monitoring tools to get started and invest in a vendor tool when they are at the maturity you need.

About the Author

Lina Weichbrodt is a machine learning consultant with 10+ years of experience developing scalable machine learning models for millions of users and running them in production. Follow her on LinkedIn for more insights.

AI Guild Announcements

Not a member of the AI Guild yet?

Apply online at?https://www.theguild.ai/.

Do you have a use case you would like to share??

Pitch it to us here!

Special thanks to?Evan Simpson?for acting as editor of?Deploy It Already.??

要查看或添加评论,请登录

AI Guild的更多文章

社区洞察

其他会员也浏览了