A Practitioner's Guide to Monitoring Machine Learning Applications
How to prioritize your monitoring efforts while avoiding alert fatigue
The machine learning monitoring landscape is evolving fast. You may be tempted to use the latest tool and hope it works out of the box. However, this could lead to receiving many false alerts or missing issues.
In this article, I present an easy-to-implement prioritization approach that you can use with either your own backend monitoring or a vendor monitoring tool. It is based on more than 30 large-scale models I have run in production over the last ten years.
Note: As the image below shows, machine learning monitoring should be added on top of typical backend monitoring. For engineers and data scientists without production experience, see this article, which provides a hands-on introduction to backend monitoring.
Traditional Software Monitoring is not Sufficient for Machine Learning Applications
The Added complexity of ML Ops over DevOps (Source)
When you apply only traditional backend monitoring to machine learning applications, you will experience silent failures. These failures have a massive negative impact on the quality of your application's response, adversely affecting user experience or the company's revenue.
Some examples of silent failures I've personally observed are
These examples are by no means exhaustive, but they do highlight the need for additional monitoring. To make matters worse, many of these bugs are permanent, degrading our performance by as much as 10 or 15%, while only massive problems (>50% worse) will be detected by users or stakeholders.
How to prioritize monitoring for impact and avoid alert fatigue
"We introduced a machine learning observability tool, and now we get several alerts each week that an input field's distribution has changed. The reasons are mostly upstream business changes or unexplained changes in the input data. We did not take any action on the alerts."
Source: Data scientist working at a market-leading financing platform
Alerts that don't prompt clear actions will be ignored. Many tools and vendors offer input data monitoring, and while monitoring input data is valuable, I advise against it as a first step or a standalone measure.
Instead, I advocate taking a page out of site reliability engineering's book and recommend focusing on customer impact-based metrics. You prioritize backward from the output:
I will cover the top priorities in this article. For a discussion of the remaining steps, please see my presentation from #datalift22.
Measure Evaluation Metrics in Production
For some machine learning applications, you get to know the actual value of your prediction, usually with a delay.
For example: predicting the delivery time of food.
After the food arrives, you can compare your prediction to the actual observed value. The metrics are then calculated over many examples. You can compare them to metrics measured on historical data during model development.
To monitor the evaluation metrics in production, take the following steps:
领英推荐
Monitor the distribution of your response
You should also monitor your application's response distribution. The response is the return value after all postprocessing steps and business rules. For classification models, this can be a prediction score. For regression models, it is also a numerical value.
The response value is an excellent proxy for quality monitoring. It does not measure how well the model fits its target function, like evaluation metrics. However, it does change when the quality deteriorates (e.g., an aggressive filter removes many high-quality predictions, an important input variable changes the output score drastically).
Measuring the response distribution offers many significant benefits. For instance, it is:
So how do you collect the response value?
Bonus tip: Monitor negative user experiences in a separate metric (e.g., your service returns empty, a low certainty response, or a fallback). Brainstorm proxies for your use case. Create an alert on the percentage of bad responses. Downside monitoring is critical, so don't wait until your stakeholders or users notify you.
The Limit of Today's Machine Learning Monitoring
It is worth mentioning that today's machine learning monitoring methods will not alert you to all individual bad predictions. It works instead on the whole traffic or on segments. If you run anything where even a single failure is potentially catastrophic, like health-related predictions, consider measures like easy-to-find objection mechanisms for end-users, partial automation over full automation, and humans in the loop.
Key Takeaways
About the Author
Lina Weichbrodt is a machine learning consultant with 10+ years of experience developing scalable machine learning models for millions of users and running them in production. Follow her on LinkedIn for more insights.
AI Guild Announcements
Not a member of the AI Guild yet?
Apply online at?https://www.theguild.ai/.
Do you have a use case you would like to share??
Special thanks to?Evan Simpson?for acting as editor of?Deploy It Already.??