登录查看更多内容

MLOps

Rajeev M A

Enterprise Architect at Tata Consultancy Services Focused on Artificial Intelligence

发布日期: 2023年4月2日

Why do many Machine Learning (ML) projects fail? Another way to look at it is, why many software projects fail? Can it be attributed to the end to end lifecycle management of software systems? The synergy between people, process, and technology is very important for successful implementation and maintenance of projects. As I always say, technology is just a means to achieve a business goal/outcome.

The shelf life of many software systems are a decade or more. That means development happened for 2 to 3 years and it is maintained for more than 7 years with minor enhancements and bug fixes. The people who developed it might not be the ones maintaining it. DevOps started gaining popularity around 2008 due to the issues the software communities faced, a disconnected development and operations paradigm. The problem existed much before 2008. DevOps is a journey rather than a destination. We hear extensions like DevSecOps which adds security to development and operations.

Most of us are familiar with Software Development Life Cycle (SDLC) or Software Release Life Cycle (SRLC). Are we familiar with Software Operations Life Cycle (SOLC)? If the maintenance or operations life cycle is much more than its development life cycle, why are we not hearing much about it? This questions is very critical in ML projects because the operations life cycle is very important in ML. We learn patterns from data which is subjected to constant change which means the model needs constant rebuild and monitoring during its operations life cycle. The article talks about end to end ops in machine learning.

No alt text provided for this image — End to End Ops in Machine Learning

Business Ops: There are four levels of value system when it comes to business. Adding value to an existing business process, optimizing the existing business process, creating value by defining a new business process, and creating higher value by synergizing multiple systems through ecosystem play. The biggest question we need to answer is, how can ML add value to a business process? Should a business process be modified or reengineered while using ML should be decided on a case by case basis. The Fear Of Missing Out (FOMO) should not be driving such decisions, rather the deciding factor should be the value ML can add as part of the work.

DataOps: Data is a first class citizen in machine learning systems because we tend to learn the patterns from it. This is different than the traditional software development in which, for any given set of inputs we always get the same outputs. Traditional software systems are very much deterministic in nature whereas machine learning systems are stochastic in nature. This makes development, testing, and operations challenging in ML systems. Such systems need to address data governance (quality, lineage, missing data, security. etc), automated feature engineering, synthetic data generation, data confidentiality, etc.

Supervised learning is predominantly used today even though unsupervised or semi supervised learning is gaining traction. Supervised learning requires annotation of data which in itself is a major task based on the data to be annotated, type of data, and domain of the data.

ML/Model Ops: Recently I saw an article which states "MLOps is 98% Data Engineering". There are overlaps between MLOps, Data Engineering, & Software Engineering. The question is how much of an overlap? 98% looks too high in my opinion, but rather the real question is what makes MLOps unique?

Many practical situations require periodic/frequent model building (also called training) because the data changes frequently in some tasks and domains. Automation is the best way to achieve it. Automated Continuous Integration / Continuous Deployment / Continuous Delivery (CI/CD/CD) is a must. Models need to be versioned and validated for accuracy constantly. Just like any artefact, they need to be managed and governed. General purpose hardware's and accelerator have equal share in this space even though accelerators have near dominance in extremely large models or extremely large scale data training.

领英推荐

MLOps - A Simple Introduction

Sandip Das 2 个月前

Harnessing AI and ML for Transformative Software…

Dr. Jagreet Kaur 9 个月前

Practical Guidelines for End-to-End Model Deployment

Durapid Technologies Private Limited 2 个月前

IT Model Ops: There are multiple ways to deploy the model (also called inference) and multiple devices to which we deploy the model. Detached intelligence ensures that connectivity is not required to internet once the model is deployed in many edge devices to convert data to information and information to knowledge. The server side deployment is mostly exposed as web services for applications to consume. The models are deployed to varying hardware depending on purpose. What defines the hardware is the 4 P's which are Purpose, Performance, Power, & Price. CPU's still dominate the field even though they are general purpose hardware. Accelerators like GPU's or FPGA's or ASIC's are used when performance is a critical criteria. For most part, CPU's are sufficient.

There are techniques like eXtreme Model Optimization (XMO) which includes steps like

Pruning
Quantization
Low-rank approximation and sparsity
Knowledge distillation
Neural Architecture Search (NAS).

Such techniques are used in situations where model optimization leads to lesser compute and memory requirement at the cost of accuracy, at times. Edge devices which depends on battery power can get maximum benefits from XMO.

Sustainable Ops: Net Zero ML is the ultimate goal of the industry. To achieve it we need to take small steps like code/model optimization, increasing the efficiency of traditional algorithms or neural network topologies, transfer learning as opposed to full training, reducing overfitting by reducing the number of parameters to learn, reducing ML carbon foot print by mixed/low precision compute, etc. Many advances in software and hardware are making it easy for organizations to move towards the goal of Net Zero ML. We are far away from it, but steady progress is being made.

Human Centered Ops: ML models/pipelines suffer from the same issues humans suffer from, that is bias, ethics, fairness, interpretability, explainability, and responsibility. After all machines are learning from human collected data, human defined process, etc. Many regulations mandate that such issues be addressed prior to deploying the model. The goal is to address such issues as part of the ML pipeline. Progress is being made in each of this area. To an outside practitioner, it might look like "Competence without comprehension".

Sec Ops: Security is an integral part of any system. AI/ML is no way different. Since we learn patterns from data, data security is utmost important. Concepts like information poisoning should be monitored since it can happen over a long period of time. We are learnings from many open source datasets. They should be evaluated for relevancy. Model security becomes an issue when we use techniques like transfer learning or federated learning. Weight averaging from multiple sources can lead to issues that are not anticipated provided there are low trust partners as part of the ecosystem. App/services security is the next level which needs to be protected against malicious actors.

Drift Ops: Pattern drift happens all the time. The magnitude and direction differs based on task and domain. We learn patterns from data and data is subjected to drift. Ultimately we need adaptive systems which needs to incorporate data drift, concept drift, and model drift to be successful.

The end to end lifecycle of a machine learning system is similar to other software systems. The success of such systems depend on how well each of its parts are conceptualized, implemented, and operationalized. The failures of ML projects can be attributed to failures in understanding the end to end lifecycle. The synergy between people, process, and technology is of utmost importance.

要查看或添加评论，请登录

Rajeev M A的更多文章

The 7R Model of AI Evolution: From Retrieval to Retroponitic

2025年2月10日

The 7R Model of AI Evolution: From Retrieval to Retroponitic

Artificial Intelligence (AI) has been on an extraordinary journey of growth, evolving through distinct stages of…

4 条评论
There is No Innovation Without an Invoice

2024年10月27日

There is No Innovation Without an Invoice

Introduction Success of technology depends on the value it adds to the business (consumer and enterprise). Some…

5 条评论
Generative AI

2023年11月19日

Generative AI

Generative AI, often referred to as GenAI, is a specialized subset of artificial intelligence dedicated to creating…
Applications of Artificial Intelligence in the Power Sector

2023年9月10日

Applications of Artificial Intelligence in the Power Sector

I had the privilege of speaking at the National Symposium on Emerging Technologies for Green Energy, an event organized…

6 条评论
Stochasticity in Business Process

2023年6月19日

Stochasticity in Business Process

Normally business processes are deterministic in nature. Rule based systems are fundamental part of any business…

4 条评论
Machine Learning Stack

2023年3月19日

Machine Learning Stack

Technology is used to solve business problems. Some problems cannot be directly codified or expressed as a set of…

1 条评论
Machine Learning Journey - Part 2

2022年9月18日

Machine Learning Journey - Part 2

AI today is synonymous with Machine Learning (ML) even though AI is beyond ML. When I first wrote Machine Learning…
Why AI is Harder Than We Think - Human Intelligence vs Machine Intelligence

2022年3月21日

Why AI is Harder Than We Think - Human Intelligence vs Machine Intelligence

Melanie Mitchell wrote a paper titled "Why AI is Harder Than We Think" almost a year back. My observation is very…

2 条评论
Color and Classification

2022年1月5日

Color and Classification

Classification of image is one of the basic task in Machine Learning (ML), specifically in Computer Vision (CV). Many…

7 条评论
"Competence without comprehension"

2021年1月16日

"Competence without comprehension"

Technological singularity is within the lifetime of many of us according to futurists. It needs to be seen if this is…

2 条评论

See all articles

MLOps

Rajeev M A

Enterprise Architect at Tata Consultancy Services Focused on Artificial Intelligence

领英推荐

Rajeev M A的更多文章

社区洞察

其他会员也浏览了

MLOps - Simplifying ML Deployment in Production

Components of MLOps

Automated Testing for Scalable AI Systems: Best Practices from SRE

SRE and MLOps: The Path to Scalable, Reliable AI Operations

AI Software Development: Insights for Today's Companies

Why Automated Testing is Essential for Reliable MLOps Pipelines

After your machine learning production deployment, the real work begins! MLOps is here to set you up for the long run

MLops

Model deployment

Automated Testing: Enhancing Reliability in MLOps Pipelines

领英推荐

Rajeev M A的更多文章

The 7R Model of AI Evolution: From Retrieval to Retroponitic

There is No Innovation Without an Invoice

Generative AI

Applications of Artificial Intelligence in the Power Sector

Stochasticity in Business Process

Machine Learning Stack

Machine Learning Journey - Part 2

Why AI is Harder Than We Think - Human Intelligence vs Machine Intelligence

Color and Classification

"Competence without comprehension"

社区洞察

其他会员也浏览了

MLOps - Simplifying ML Deployment in Production

Components of MLOps

Automated Testing for Scalable AI Systems: Best Practices from SRE

SRE and MLOps: The Path to Scalable, Reliable AI Operations

AI Software Development: Insights for Today's Companies

Why Automated Testing is Essential for Reliable MLOps Pipelines

After your machine learning production deployment, the real work begins! MLOps is here to set you up for the long run

MLops

Model deployment

Automated Testing: Enhancing Reliability in MLOps Pipelines