Machine Learning Model Deployment Challenges: Is a low average test set error enough?

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

å‘å¸ƒæ—¥æœŸ: 2023å¹´5æœˆ11æ—¥

Your Machine Learning model does well on the test set â€“ the job of the Machine Learning Engineer might have been much simpler if it is just enough to get a low average test set error, but it isnâ€™t! In my blog of some days ago , I have spoken about concepts of Data Drift and Concept Drift but there are some additional challenges that must be addressed for a production ready Machine Learning project.

A machine learning system may have a low-test set error but if its performance on certain disproportionate important examples, key slices of data, each class of data, etc isnâ€™t good enough then the machine learning system is not acceptable. Let us understand this statement through some examples:

Disproportionately important examples:

1)??For example, you might have developed a machine learning model for document search and extracting specific text from a huge database / document management system (this can be compared to a web/google search). Let us say, you search a query of the form:

â€œBest fatigue life estimation methods for the bore of an IP Compressorâ€

In such case, the machine learning model might extract several good methods from several documents of the database/document management system that may be used to life the bore of an IP Compressor of an engine. Out of these methods some might be relevant, and some might not be relevant to your search. Such discrepancies might be acceptable to the stakeholder as the search query was quite generic.

Now, suppose you type the search query of the form:

â€œLow Cycle Fatigue Life of IP Compressor bore of XYZ engine of flight profile ABCâ€

And if the Machine Learning model gives an irrelevant response, then, this might not be acceptable to the stakeholder as the query was direct/very specific and your stakeholder might resort to manual intervention with the search and your ML model might lose all the hype ! One could think of assigning higher weights to training examples of the form above, but that might make things bit complicated.

Thus, evaluation of the model on disproportionate examples becomes important. This is closely related to the performance of the model on key slices the dataset discussed next.

Model evaluation on key slices of the dataset

2)?Another example closely related is evaluation of model on key slices of dataset. For example, if one has built a machine learning model for classifying the loan approval from a financial organization or a bank approval (say: yes/no), then, your model might have a high average test set score. But going deeper, you might realize the model is biased towards gender or biased towards certain ethnicity of the customer. This might ultimately affect the business and is surely not acceptable. Thus, the evaluation of the model certain slices of dataset related to features involving ethnicity/gender (in such case) is important.

3)??Another example related to a stress / life estimation engineer (from the world I was born and brought up in!) might be, one would have developed a regression model to predict the life. The average test set accuracy might be large, but the model might not be predicting correctly on certain examples involving creep â€“ i.e. where the feature under consideration has been subjected to considerable creep strain which might be detrimental or add to life (e.g. because of retardation due to creep) â€“ thus, evaluation of the machine learning model on certain slices of data relating to features such as creep ?becomes important.]

Skewed Datasets â€“ Precision, Recall and F1-Score

4)??I have discussed about Skewed Datasets wherein different evaluation metrics than accuracy become important / absolutely necessary. You might refer to my blog post here. Thus, I'm not repeating the content.

Evaluation metrics for each class:

5) Another example wherein a different handling of the evaluation metrics become important is when you might be dealing with multi-class classification problems. Letâ€™s again revisit the manufacturing problem that I had discussed in my blog related to Data Drift and Concept here. Letâ€™s say we are identifying defects in a part of an engine. Letâ€™s say this is a multi-class classification problem. The defect may be scratches, dents, pits.

No alt text provided for this image — Surface of the defect: scratch, dents, traces of wear, micro-scratches

Even though the overall Precision, Recall and F1-score (for definitions of Precision, Recall and F1-Score, see my blog here) might be satisfactory, we might want to ascertain that the Precision, Recall and F1-Score for each defect is satisfactory. In that case, we will require to obtain the Precision, Recall and F1-Score for every class as below:

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ24æ—¥

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on â€œFine-Tuning of LLMsâ€ and is the fourthâ€¦
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ10æ—¥

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in theâ€¦
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ4æ—¥

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025å¹´2æœˆ1æ—¥

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogsâ€¦
RAG Beyond Basics:

2025å¹´1æœˆ7æ—¥

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generationâ€¦
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024å¹´10æœˆ24æ—¥

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community andâ€¦

2 æ¡è¯„è®º
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024å¹´9æœˆ23æ—¥

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 æ¡è¯„è®º
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024å¹´8æœˆ26æ—¥

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024å¹´8æœˆ18æ—¥

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024å¹´6æœˆ30æ—¥

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on â€œGraphsâ€ and is the eleventh article in theâ€¦

See all articles

Ajay Tanejaçš„æ›´å¤šæ–‡ç«

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

ç¤¾åŒºæ´žå¯Ÿ