登录查看更多内容

Key issues (Post-production) in an ML based solution

Jitender Malik

SVP | Data engineering & Science(AI/ML, Gen AI, Computer Vision) | AI Engineering Lead at NatWest Group

发布日期: 2025年3月2日

In my last article, I talked about the key challenges in AI adoption. Even after organizations successfully build and deploy AI models, they often face problems down the line. For example, let’s say you identify a use case, select the right model, and deploy it, expecting it to bring value to your business.

However, over time, you may notice that the model’s predictions start to decline, making it unusable. At this point, you have three choices: update the model, rebuild it, or go through the entire development cycle again.

This is a hard lesson many organizations learn—the job isn’t done once the model is deployed. AI models degrade over time, and continuous monitoring is essential to detect and fix issues before they impact performance.

We'll start by exploring why ML models that work well during development often fail in production. Then, we'll focus on a common and tricky challenge faced by most ML models—data distribution shifts. This happens when the data the model sees in production is different from what it learned from during training. We'll also discuss how to monitor and detect these shifts to keep your model performing well

Causes of ML System Failures

Before we dive into the causes of ML system failures, let’s first understand what failure means in this context. In traditional software, failure usually means the system doesn’t follow expected logic or operations.

For an ML system, failure is more than just operational—it also includes the model’s performance. Take an OCR system as an example. Its operational expectation is speed—it should process an image and return a result within 2 seconds. Its model expectation is accuracy—it should correctly interpret the text at least 99% of the time.

Now, if you upload an image and don’t get any output, that’s a system failure because it violates the system expectation. However, if you get an output but some errors, it may not be a failure right away, as some mistakes are expected. But if most results are consistently wrong, then the model’s accuracy expectation is broken, making it a failure.

Operational expectation violations are easier to detect, as they have a breakage condition such as a timeout, a 404 error, out-of-memory error etc.. However, ML performance expectation violations are harder to detect as doing so requires measuring and monitoring the performance of ML models in production. For this reason, we say that ML systems often fail silently.

ML-Specific Failures

ML-specific failures are unique to machine learning systems. Some common examples include issues with data collection and processing, incorrect hyperparameters, mismatches between training and inference pipelines, data distribution shifts that degrade model performance over time, unexpected edge cases, and feedback loops that worsen predictions.

These failures are often more challenging than traditional software failures because they are harder to detect and fix. In some cases, they can even make the ML system unusable.

We’ll explore three common problems that arise after deployment:

Production data differs from training data – The model struggles because real-world data looks different from what it was trained on.
Edge cases – Unusual scenarios that the model wasn’t prepared for can lead to incorrect predictions.
Degenerate feedback loops – The model’s own predictions influence future data, leading to a downward spiral in performance

Production data differing from training data

When we say an ML model "learns" from training data, it means the model understands patterns in the data to make accurate predictions on new, unseen data. If a model can do this well, we say it generalizes to unseen data. The test data used during development is meant to simulate real-world scenarios and help estimate how well the model will perform after deployment

However, creating a training dataset that truly represents the data a model will see in production is extremely difficult. The real world is complex, constantly changing, and nearly infinite, while training data is limited by time, computing power, and human effort. Biases in data selection and sampling can cause real-world data to diverge from training data—sometimes due to small differences like changes in data encoding.

This type of divergence leads to a common failure mode known as the train-serving skew: a model that does great in development but performs poorly when deployed

Another key challenge is that the real world isn’t static. Data distributions shift over time. For example, in 2019, searching for “Wuhan” mostly returned travel information. After COVID-19, those same searches shifted towards pandemic-related content. A model trained before this shift would struggle to provide relevant results afterward.

Many ML models perform well when first deployed but degrade over time as data patterns evolve. That’s why continuous monitoring is essential—without it, models can become outdated and unreliable in production.

Edge cases

Imagine you have an OCR-based solution that delivers 99.99% accuracy. Sounds great, right? But what about the 0.01% of cases where it makes errors? Now, let’s say this OCR system is used to extract data from financial statements that will be presented to senior stakeholders.

Even a tiny margin of error can be unacceptable in high-stakes scenarios like finance, healthcare, or legal documents. A misread figure, an extra zero, or a missing decimal point could lead to serious consequences—misreporting revenue, incorrect financial decisions, or regulatory issues

Would you trust this solution?

If you’re thinking, "No, I wouldn’t use this OCR model," you’re not alone. An ML model that works well most of the time but fails in critical cases can be unusable—especially when those failures have serious consequences.

This is exactly why many organizations hesitate to fully digitize important documents. Even a small failure rate in processing financial statements, legal contracts, or medical records can lead to costly mistakes, compliance issues, or reputational damage.

What are Edge Cases? Edge cases are rare but extreme data samples that push a model to catastrophic failure. They typically belong to the same general data distribution, but if a sudden spike in these failures occurs, it could mean that the underlying data distribution has shifted—a major warning sign that the model is no longer reliable.

Degenerate feedback loops

A normal feedback loop helps improve ML models by collecting user feedback and using it to refine predictions in future iterations. However, sometimes feedback loops can go wrong, leading to degenerate feedback loops

A degenerate feedback loop can happen when the predictions themselves influence the feedback, which, in turn, influences the next iteration of the model. More formally, a degenerate feedback loop is created when a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs. In ML, a system’s predictions can influence how users interact with the system, and because users’ interactions with the system are sometimes used as training data to the same system, degenerate feedback loops can occur and cause unintended consequences. Degenerate feedback loops are especially common in tasks with natural labels from users, such as recommender systems.

Degenerate feedback loops can make ML models worse over time instead of better. Monitoring these loops is crucial for keeping AI systems fair, unbiased, and effective.

Anwar Basha Shaik

Senior Technical Project Manager - Agentic AI

1 周

Insightful

要查看或添加评论，请登录

Jitender Malik的更多文章

6 Core steps for choosing a ML Model

2025年2月24日

6 Core steps for choosing a ML Model

There are many possible solutions to any given problem. Given a task that can leverage ML in its solution, you might…

1 条评论
LLM Agents: Reasoning and acting (ReAct)

2025年1月5日

LLM Agents: Reasoning and acting (ReAct)

In this article I have covered three things to start the series of articles on Agents. first, what is LLM agents? And…
LLM's: Chain of thought prompting

2024年10月6日

LLM's: Chain of thought prompting

Chain rule(Backpropagation - Wikipedia) doesn’t get the appreciation it should. Without it back-propagation (Backbone…

1 条评论
Deep Learning 1: ANN (Artificial Neural Network) Architecture

2024年7月28日

Deep Learning 1: ANN (Artificial Neural Network) Architecture

Neuron and perceptron Deep learning is heavily inspired by our own nervous system. Just as our nervous system works…

4 条评论
Logistic regression: A deep learning approach.

2024年7月20日

Logistic regression: A deep learning approach.

Logistic regression is one of the most modern machine learning algorithms, and it is important because if you want to…

3 条评论
Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

2024年7月13日

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

This article talks about the journey of transformer architecture where the 4 groundbreaking research paper brought the…

1 条评论
A data driven approach for scalable Integration testing.

2024年6月29日

A data driven approach for scalable Integration testing.

Note: This article talks about using statistics to scale integration testing using pact flow. for detailed…

3 条评论
EMIR Refit Pairing and matching : A machine learning approach.

2024年6月22日

EMIR Refit Pairing and matching : A machine learning approach.

The EMIR mandates EU counterparties to report their transactions to trade repositories. EMIR focuses on the…

10 条评论
Comparison of Multivariate Data Using Principal Component Analysis

2024年6月16日

Comparison of Multivariate Data Using Principal Component Analysis

Lately I was working on a project where we need to solve the problem of comparing population with its sample to ensure…

8 条评论

See all articles

Causes of ML System Failures

ML-Specific Failures

Production data differing from training data

Edge cases

Degenerate feedback loops

Jitender Malik的更多文章

6 Core steps for choosing a ML Model

LLM Agents: Reasoning and acting (ReAct)

LLM's: Chain of thought prompting

Deep Learning 1: ANN (Artificial Neural Network) Architecture

Logistic regression: A deep learning approach.

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

A data driven approach for scalable Integration testing.

EMIR Refit Pairing and matching : A machine learning approach.

Comparison of Multivariate Data Using Principal Component Analysis