Machine Learning Model Deployment Challenges: Is a low average test set error enough?

Machine Learning Model Deployment Challenges: Is a low average test set error enough?

Your Machine Learning model does well on the test set – the job of the Machine Learning Engineer might have been much simpler if it is just enough to get a low average test set error, but it isn’t! In my blog of some days ago , I have spoken about concepts of Data Drift and Concept Drift but there are some additional challenges that must be addressed for a production ready Machine Learning project.

A machine learning system may have a low-test set error but if its performance on certain disproportionate important examples, key slices of data, each class of data, etc isn’t good enough then the machine learning system is not acceptable. Let us understand this statement through some examples:


Disproportionately important examples:

1)??For example, you might have developed a machine learning model for document search and extracting specific text from a huge database / document management system (this can be compared to a web/google search). Let us say, you search a query of the form:

?

“Best fatigue life estimation methods for the bore of an IP Compressor”

?

In such case, the machine learning model might extract several good methods from several documents of the database/document management system that may be used to life the bore of an IP Compressor of an engine. Out of these methods some might be relevant, and some might not be relevant to your search. Such discrepancies might be acceptable to the stakeholder as the search query was quite generic.

?

Now, suppose you type the search query of the form:

“Low Cycle Fatigue Life of IP Compressor bore of XYZ engine of flight profile ABC”

And if the Machine Learning model gives an irrelevant response, then, this might not be acceptable to the stakeholder as the query was direct/very specific and your stakeholder might resort to manual intervention with the search and your ML model might lose all the hype ! One could think of assigning higher weights to training examples of the form above, but that might make things bit complicated.

?

Thus, evaluation of the model on disproportionate examples becomes important. This is closely related to the performance of the model on key slices the dataset discussed next.


Model evaluation on key slices of the dataset

2)?Another example closely related is evaluation of model on key slices of dataset. For example, if one has built a machine learning model for classifying the loan approval from a financial organization or a bank approval (say: yes/no), then, your model might have a high average test set score. But going deeper, you might realize the model is biased towards gender or biased towards certain ethnicity of the customer. This might ultimately affect the business and is surely not acceptable. Thus, the evaluation of the model certain slices of dataset related to features involving ethnicity/gender (in such case) is important.

?

3)??Another example related to a stress / life estimation engineer (from the world I was born and brought up in!) might be, one would have developed a regression model to predict the life. The average test set accuracy might be large, but the model might not be predicting correctly on certain examples involving creep – i.e. where the feature under consideration has been subjected to considerable creep strain which might be detrimental or add to life (e.g. because of retardation due to creep) – thus, evaluation of the machine learning model on certain slices of data relating to features such as creep ?becomes important.]


Skewed Datasets – Precision, Recall and F1-Score

4)??I have discussed about Skewed Datasets wherein different evaluation metrics than accuracy become important / absolutely necessary. You might refer to my blog post here. Thus, I'm not repeating the content.


Evaluation metrics for each class:

5) Another example wherein a different handling of the evaluation metrics become important is when you might be dealing with multi-class classification problems. Let’s again revisit the manufacturing problem that I had discussed in my blog related to Data Drift and Concept here. Let’s say we are identifying defects in a part of an engine. Let’s say this is a multi-class classification problem. The defect may be scratches, dents, pits.

?

No alt text provided for this image
Surface of the defect: scratch, dents, traces of wear, micro-scratches

Even though the overall Precision, Recall and F1-score (for definitions of Precision, Recall and F1-Score, see my blog here) might be satisfactory, we might want to ascertain that the Precision, Recall and F1-Score for each defect is satisfactory. In that case, we will require to obtain the Precision, Recall and F1-Score for every class as below:

No alt text provided for this image
Example: Evaluation metrics required for each class

要查看或添加评论,请登录

Ajay Taneja的更多文章

社区洞察