Key issues (Post-production) in an ML based solution

Key issues (Post-production) in an ML based solution

In my last article, I talked about the key challenges in AI adoption. Even after organizations successfully build and deploy AI models, they often face problems down the line. For example, let’s say you identify a use case, select the right model, and deploy it, expecting it to bring value to your business.

However, over time, you may notice that the model’s predictions start to decline, making it unusable. At this point, you have three choices: update the model, rebuild it, or go through the entire development cycle again.

This is a hard lesson many organizations learn—the job isn’t done once the model is deployed. AI models degrade over time, and continuous monitoring is essential to detect and fix issues before they impact performance.

We'll start by exploring why ML models that work well during development often fail in production. Then, we'll focus on a common and tricky challenge faced by most ML models—data distribution shifts. This happens when the data the model sees in production is different from what it learned from during training. We'll also discuss how to monitor and detect these shifts to keep your model performing well

Causes of ML System Failures

Before we dive into the causes of ML system failures, let’s first understand what failure means in this context. In traditional software, failure usually means the system doesn’t follow expected logic or operations.

For an ML system, failure is more than just operational—it also includes the model’s performance. Take an OCR system as an example. Its operational expectation is speed—it should process an image and return a result within 2 seconds. Its model expectation is accuracy—it should correctly interpret the text at least 99% of the time.

Now, if you upload an image and don’t get any output, that’s a system failure because it violates the system expectation. However, if you get an output but some errors, it may not be a failure right away, as some mistakes are expected. But if most results are consistently wrong, then the model’s accuracy expectation is broken, making it a failure.

Operational expectation violations are easier to detect, as they have a breakage condition such as a timeout, a 404 error, out-of-memory error etc.. However, ML performance expectation violations are harder to detect as doing so requires measuring and monitoring the performance of ML models in production. For this reason, we say that ML systems often fail silently.


ML-Specific Failures

ML-specific failures are unique to machine learning systems. Some common examples include issues with data collection and processing, incorrect hyperparameters, mismatches between training and inference pipelines, data distribution shifts that degrade model performance over time, unexpected edge cases, and feedback loops that worsen predictions.

These failures are often more challenging than traditional software failures because they are harder to detect and fix. In some cases, they can even make the ML system unusable.

We’ll explore three common problems that arise after deployment:

  1. Production data differs from training data – The model struggles because real-world data looks different from what it was trained on.
  2. Edge cases – Unusual scenarios that the model wasn’t prepared for can lead to incorrect predictions.
  3. Degenerate feedback loops – The model’s own predictions influence future data, leading to a downward spiral in performance


Production data differing from training data

When we say an ML model "learns" from training data, it means the model understands patterns in the data to make accurate predictions on new, unseen data. If a model can do this well, we say it generalizes to unseen data. The test data used during development is meant to simulate real-world scenarios and help estimate how well the model will perform after deployment


However, creating a training dataset that truly represents the data a model will see in production is extremely difficult. The real world is complex, constantly changing, and nearly infinite, while training data is limited by time, computing power, and human effort. Biases in data selection and sampling can cause real-world data to diverge from training data—sometimes due to small differences like changes in data encoding.

This type of divergence leads to a common failure mode known as the train-serving skew: a model that does great in development but performs poorly when deployed

Another key challenge is that the real world isn’t static. Data distributions shift over time. For example, in 2019, searching for “Wuhan” mostly returned travel information. After COVID-19, those same searches shifted towards pandemic-related content. A model trained before this shift would struggle to provide relevant results afterward.

Many ML models perform well when first deployed but degrade over time as data patterns evolve. That’s why continuous monitoring is essential—without it, models can become outdated and unreliable in production.

Edge cases

Imagine you have an OCR-based solution that delivers 99.99% accuracy. Sounds great, right? But what about the 0.01% of cases where it makes errors? Now, let’s say this OCR system is used to extract data from financial statements that will be presented to senior stakeholders.

Even a tiny margin of error can be unacceptable in high-stakes scenarios like finance, healthcare, or legal documents. A misread figure, an extra zero, or a missing decimal point could lead to serious consequences—misreporting revenue, incorrect financial decisions, or regulatory issues

Would you trust this solution?


Edge cases

If you’re thinking, "No, I wouldn’t use this OCR model," you’re not alone. An ML model that works well most of the time but fails in critical cases can be unusable—especially when those failures have serious consequences.

This is exactly why many organizations hesitate to fully digitize important documents. Even a small failure rate in processing financial statements, legal contracts, or medical records can lead to costly mistakes, compliance issues, or reputational damage.

What are Edge Cases? Edge cases are rare but extreme data samples that push a model to catastrophic failure. They typically belong to the same general data distribution, but if a sudden spike in these failures occurs, it could mean that the underlying data distribution has shifted—a major warning sign that the model is no longer reliable.

Degenerate feedback loops

A normal feedback loop helps improve ML models by collecting user feedback and using it to refine predictions in future iterations. However, sometimes feedback loops can go wrong, leading to degenerate feedback loops

A degenerate feedback loop can happen when the predictions themselves influence the feedback, which, in turn, influences the next iteration of the model. More formally, a degenerate feedback loop is created when a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs. In ML, a system’s predictions can influence how users interact with the system, and because users’ interactions with the system are sometimes used as training data to the same system, degenerate feedback loops can occur and cause unintended consequences. Degenerate feedback loops are especially common in tasks with natural labels from users, such as recommender systems.

Degenerate feedback loops can make ML models worse over time instead of better. Monitoring these loops is crucial for keeping AI systems fair, unbiased, and effective.

Anwar Basha Shaik

Senior Technical Project Manager - Agentic AI

1 周

Insightful

回复

要查看或添加评论,请登录

Jitender Malik的更多文章