Machine Learning Monitoring, Part 3: What Can Go Wrong With Your Data?
Emeli Dral
Co-founder and CTO Evidently AI | Machine Learning Instructor w/100K+ students
This blog is a part of the Machine Learning Monitoring series. Be sure to check Part 1 on Why Monitoring Matters and Part 2 on Who Should Care About ML Monitoring.
Now, let’s get into more detail.
What Can Go Wrong With The Data?
As the saying goes: garbage in is garbage out. Input data quality is the most crucial component of a machine learning system. Whether or not you have an immediate feedback loop, your monitoring always starts here.
There are two types of data issues one encounters. Put simply: 1) something goes wrong with the data itself; or 2) the data changes because the environment does.
Let us start with the first category. It alone has plenty.
#1 Data processing issues
A machine learning application usually relies on upstream systems to provide inputs. The most trivial— but frequent— occasion is when the production model does not receive the data. Or, it receives corrupted or limited data, all due to some pipeline issues.
Let’s take a marketing example.
The data science team in a bank developed a mighty machine learning system to personalize promo offers sent to clients each month.
This system uses data from an internal customer database, clickstream logs from the internet banking and mobile app, and call center logs. Also, the marketing team manually maintains a spreadsheet where they add this month’s promo options.
All the data streams are merged and stored in a data warehouse. When the model is run, it calculates the necessary features on top of the joint table. The model then ranks the offers for each client based on the likelihood of acceptance and spits the result.
A simplified pipeline jungle for the promo personalization use case.
This pipeline uses multiple data sources. And, a different functional owner maintains each of them. Quite some opportunity to mess with it!
Here is an incomplete list of the nasty things to happen:
- Wrong source. A pipeline points to an older version of the marketing table, or there is an unresolved version conflict.
- Lost access. Someone moved the table to a new location but did not update the permissions.
- Bad SQL. Or not SQL. Whatever you use to query your data. These JOINSs and SELECTs might work well until the first complication. Say, a user showed up from a different time zone and made an action “tomorrow”? The query might not hold up.
- Infrastructure update. You got a new version of a database and some automated spring cleaning. Spaces replaced with underscores, all column names in lowercase. All looks fine until your model wants to calculate its regular feature as “Last month income/Total income”. With hard-coded column titles. Ouch!
- Broken feature code. I dare to say, the data science code is often not production-grade. It can fail in corner cases. For instance, the promo discounts were never more than 50% in training. Then marketing introduces a “free” offer and types 100% for the first time. Some dependent feature code suddenly makes no sense and returns negative numbers.
When data processing goes bad, the model code can simply crash. At least, you’ll learn about the issue fast. But if your Python code had some “Try...Except” clauses, it might execute on incorrect and incomplete input. The consequences are all yours.
The promo example we looked at has batch inference. It is less dramatic. You have some room for error. If you catch the pipeline issue on time, you can simply repeat the model run.
In high-load streaming models, the data processing problems multiply (think e-commerce, gaming, or bank transactions).
#2 Data schema change
In other cases, data processing works just fine. But then a valid change happens at the data source. Whatever the reason, new data formats, types, and schemas are rarely good news to the model.
On top of this, the author of the change is often unaware of the impact. Or, that some model even exists down there.
Let’s go back to the promo example.
One day, the call center’s operational team decides to tidy up the CRM and enrich the information they collect after each customer call.
They might introduce better, more granular categories to classify calls by the type of issue. They would also ask each client on their preferred communication channel, and start to log this in a new field. And since we are here: let’s rename and change the order of fields to make it more intuitive for new users.
Now, that looks neat!
But not so to the model.
In technical terms, this all translates to lost signal.
Unless explicitly told so, the model will not match new categories with the old ones or process extra features. If there is no data completeness check, it will generate the response based on partial input it knows how to handle.
This pain is well-known to anyone who deals with catalogs.
For example, in demand forecasting or e-commerce recommendations. Often, you would have some complex features based on category type. Say, “laptop” or “mobile phone” is in “electronics.” That is expensive. Let’s make it a feature. “Phone case” is in “accessories.” That is sort of “cheap.” We’ll use that too.
Then, someone reorganizes the catalog. Now, “mobile phone” and “phone case” are both under “mobile.” A whole different category, with a different interpretation. The model will need to learn it all over again or wait until someone explains what happened.
No magic here. If catalog updates occur often, you’d better factor it into the model design. Otherwise, educate the business users and keep track of sudden changes.
Yes, real-world machine learning can be that brittle. (Image credit: Pixabay)
Some more examples:
- An update in the original business system leads to a change of unit of measurements (think Celsius to Fahrenheit) or dates formats (DD/MM/YY or MM/DD/YY?)
- New product features in the application add the telemetry that the model never trained on.
- There is a new 3rd party data provider or API, or an announced change in the format.
The irony is, domain experts can perceive the change as operational improvement. For example, a new sensor allows you to capture high granularity data at a millisecond rate. Much better! But the model is trained on the aggregates and expects to calculate them the usual way.
Lack of clear data ownership and documentation makes it harder. There might be no easy way to trace or know whom to inform about an upcoming data update inside an organization. Data quality monitoring becomes the only way to capture the change.
#3 Data loss at the source
The data not only changes. It can also be lost, due to some failure at the very source.
Sometimes your pipelines may lead to nowhere. (Image credit: Unsplash)
For example, you lose the application clickstream data due to a bug in logging. The physical sensor breaks and the temperature is no longer known. External API is not available, and so on. We want to catch these issues early since often they mean the irreversible loss of the future retraining data.
Such outages may affect only a subset of data. For instance, users in one geography or a specific operating system. This makes the detection harder. Unless another (properly monitored!) system relies on the same data source, the failure can go unnoticed.
Even worse, a corrupted source might still provide the data. For example, a broken temperature sensor will return the last measurement as a constant value. That is hard to spot unless you keep track of “unusual” numbers and patterns.
As with physical failures, we can’t always resolve the issue immediately. But catching it on time helps quickly assess the damage. If needed, we can update, replace, or pause the model.
#4 Broken upstream models
In more complex setups, you have several models that depend on each other. One model’s output is another model’s input.
This also means: one model’s broken prediction is another model’s corrupted feature.
Take a content or product recommendation engine.
It might first predict the popularity of a given product or item. Then, it makes recommendations to different users, taking into account the estimated popularity. These would be separate models, basically looped into each other. Once the item is recommended to the user, it is more likely to be clicked on, and thus more likely to be seen as “popular” by the fist model.
A more tech-y example: a car route navigation system.
First, your system constructs possible routes. Then, a model predicts the expected time of arrival for each of them. Next, another model ranks the options and decides on the optimal route. Which, sort of, influences the actual traffic jams. Once cars follow the suggested routes, this creates a new road situation.
Other models in logistics, routing, and delivery often face the same issue.
These linked systems bear an obvious risk: if something is wrong with one of the models, you get an interconnected loop of problems.
Up Next
All these varying issues ask for a number of checks for the input data quality. Some of these errors are trivial but are also most painful to miss.
How to track them? In the next blog, we will go into detail on what exactly to monitor.
---
This post first appeared at Evidently AI Blog.
At Evidently AI, we build tools for ML model monitoring and performance analytics. Want to stay in the loop?
Interesting topic. I like that you used practical examples??