登录查看更多内容

Technical Debts in ML : MLOps

Souptik Majumder

Data Scientist |Advanced analytics and Data science @ Eli Lilly

发布日期: 2022年5月1日

DataBuzz(4/n):-

This is the fourth article of the Data Buzz Series (Simplifying AI Research once a week).4/n represents the third article of the n upcoming articles??.

You can find my previous articles?here

Hidden Technical Debt in Machine Learning Systems

If you have started your journey in the field of Machine Learning Operations(MLOps) as a student or an industry practitioner, this article will definitely help you to learn, unlearn or revise the fundamental steps to follow before productionalizing any ML model.

Technical Debts refers to the possible grey areas we should focus on which can make pay heavy price for an output error.

In this article, I will briefly explain such debts in context of building a Machine Learning model. All the references and detailed explanation of them are available in this research paper here like below :

(I have personally highlighted important sections, key areas and added references to this wonderful research paper on Hidden Technical Debt in Machine Learning Systems by Google, do go through it! Save time, skim through the important areas first ! ??)

To save you some time, here are the main highlights of the likely debts to be paid in a ML system development, if unaware of:-

Model Complexity eroding model boundaries -

As the complexity of a Machine Learning Model increases, higher are the chances of mix and match of the input training data. This is called as Entanglement.

Complex models might also need another sequential/cascading ML model to learn and improve their own performance, a concept called as Correction Cascading.

And, in a complex system, a complex model might be correlated with many services and applications, not all of might be known to the Machine Learning Engineer/Developer. This might affect the model performance and make it difficult to decode a problem, hence appropriate steps should be taken to regularly check for dependencies on/by these Undeclared Consumers. Refer Section 2 here on how to solve this issue.

2. Data Dependencies -

A Data Dependency is often more expensive than a code dependency , and often Unstable data and Underutilized data dependencies which often affect the quality of data ingested for model training. Refer Section 3 here for better understanding.

3. Feedback Loops -

Direct Feedback loop and Hidden Feedback loops are the two possible ways of giving a feedback to a ML model to improve it's performance. While direct feedbacks are easy to visualize and assess, the real trick lies in finding and evaluating the hidden feedback loops to a model. Refer Section 4 here to mitigate these issues.

4.Anti patterns in ML system development -

Some common anti patterns in a ML system development are reusing an open sourced code for model, using higher number of data pipelines than required or may be missing out on an unrequired experimental piece of code in the editor or using multiple coding languages in the system which might make testing the system slow and inconvenient. Such concepts are often called as Glue Code,Pipeline Jungles,Dead Experimental Codepaths or Common smells. Refer Section 5 here to solve such issues.

领英推荐

RAG Efficiency, Self-Learning Tips, the Business of…

Towards Data Science 1 年前

Balancing Act: The Pros and Cons of Machine Learning…

Sanjay Kumar MBA,MS,PhD 1 年前

The Power of Machine Learning Algorithms

Fusion Informatics Limited 1 年前

I have added notes and highlighted all the above concepts in the paper so far. So, do read the paper for a faster understanding !

Moving on,

5. System Configuration -

Often it becomes difficult to manage and maintain a model, unless and until a systematic and unified process are used for model development, productionalizing and maintenance. Having a uniform system, helps to ensure better data ingestion, reproduce model running instances, compare model run times and improves model security. Refer Section 6 of the paper here to know more.

6. Dynamic changes in external world -

There can be quite a lot of change in the actual data going into the model in a production environment which might have a completely different distribution than the data on which the model was trained upon. This is called as a Data Drift and it is highly probable in a dynamic world. Steps to mitigate such issues have been highlighted in the paper, kindly go though section 7 here to make the model robust against such fluctuating data.

To summarize, these are the key areas to be wary of before productionalizing any Machine Learning Model. Please feel to free to comment any more step that could have been added on to the list !

I’ll go through one research paper on anything related to AI and Machine Learning every week and highlight key areas/add notes so that it becomes useful for readers to skim through the important sections quickly???

Do share my article if you like it and subscribe to my newsletter to stay Updated on the AI/ML Research!

Any suggestions/ discussions are most welcome in the comments!

Bonus Tip - For beginners in MLOps with prior experience in building a Machine Learning model, this is a great course from Andrew NG.

Happy Learning!

Buzzing with DataBuzz

435 位关注者

要查看或添加评论，请登录

Souptik Majumder的更多文章

Big Time Series forecast on Small Devices

2022年4月17日

Big Time Series forecast on Small Devices

DataBuzz(3/n):- This is the third article of the Data Buzz Series (Simplifying AI Research once a week).3/n represents…

1 条评论
NLP 101: Keep or remove these-?????

2022年4月9日

NLP 101: Keep or remove these-?????

Data Buzz (2/n):- This is the second article of the Data Buzz Series (Simplifying AI Research once a week). 2/n…

Technical Debts in ML : MLOps

Souptik Majumder

Data Scientist |Advanced analytics and Data science @ Eli Lilly

领英推荐

Buzzing with DataBuzz

435 位关注者

Souptik Majumder的更多文章

社区洞察

其他会员也浏览了

Statistical inference vs machine learning inference: significance of iid

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

The Emerging Building Blocks for Gen AI Stack

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Generative AI: Picking the Right Vector Database

The Age of Machine Learning As Code Has?Arrived

Using Azure AI to Turn Documents into Actionable Insights

Issue #161 - THE ML ENGINEER ??

Population, Sample, and Sampling Techniques in Machine Learning

The Role of Machine Learning in Accelerating Data Transformation

领英推荐

Buzzing with DataBuzz

435 位关注者

Souptik Majumder的更多文章

Big Time Series forecast on Small Devices

NLP 101: Keep or remove these-?????

社区洞察

其他会员也浏览了

Statistical inference vs machine learning inference: significance of iid

Hyperparameter Optimization, Achieving Responsible AI, and How to Hire Data Scientists

The Emerging Building Blocks for Gen AI Stack

Strategies for Improving Machine Learning Algorithms: Tips & Tricks

Generative AI: Picking the Right Vector Database

The Age of Machine Learning As Code Has?Arrived

Using Azure AI to Turn Documents into Actionable Insights

Issue #161 - THE ML ENGINEER ??

Population, Sample, and Sampling Techniques in Machine Learning

The Role of Machine Learning in Accelerating Data Transformation