MLOps
Rajeev M A
Enterprise Architect at Tata Consultancy Services Focused on Artificial Intelligence
Why do many Machine Learning (ML) projects fail? Another way to look at it is, why many software projects fail? Can it be attributed to the end to end lifecycle management of software systems? The synergy between people, process, and technology is very important for successful implementation and maintenance of projects. As I always say, technology is just a means to achieve a business goal/outcome.
The shelf life of many software systems are a decade or more. That means development happened for 2 to 3 years and it is maintained for more than 7 years with minor enhancements and bug fixes. The people who developed it might not be the ones maintaining it. DevOps started gaining popularity around 2008 due to the issues the software communities faced, a disconnected development and operations paradigm. The problem existed much before 2008. DevOps is a journey rather than a destination. We hear extensions like DevSecOps which adds security to development and operations.
Most of us are familiar with Software Development Life Cycle (SDLC) or Software Release Life Cycle (SRLC). Are we familiar with Software Operations Life Cycle (SOLC)? If the maintenance or operations life cycle is much more than its development life cycle, why are we not hearing much about it? This questions is very critical in ML projects because the operations life cycle is very important in ML. We learn patterns from data which is subjected to constant change which means the model needs constant rebuild and monitoring during its operations life cycle. The article talks about end to end ops in machine learning.
Business Ops: There are four levels of value system when it comes to business. Adding value to an existing business process, optimizing the existing business process, creating value by defining a new business process, and creating higher value by synergizing multiple systems through ecosystem play. The biggest question we need to answer is, how can ML add value to a business process? Should a business process be modified or reengineered while using ML should be decided on a case by case basis. The Fear Of Missing Out (FOMO) should not be driving such decisions, rather the deciding factor should be the value ML can add as part of the work.
DataOps: Data is a first class citizen in machine learning systems because we tend to learn the patterns from it. This is different than the traditional software development in which, for any given set of inputs we always get the same outputs. Traditional software systems are very much deterministic in nature whereas machine learning systems are stochastic in nature. This makes development, testing, and operations challenging in ML systems. Such systems need to address data governance (quality, lineage, missing data, security. etc), automated feature engineering, synthetic data generation, data confidentiality, etc.
Supervised learning is predominantly used today even though unsupervised or semi supervised learning is gaining traction. Supervised learning requires annotation of data which in itself is a major task based on the data to be annotated, type of data, and domain of the data.
ML/Model Ops: Recently I saw an article which states "MLOps is 98% Data Engineering". There are overlaps between MLOps, Data Engineering, & Software Engineering. The question is how much of an overlap? 98% looks too high in my opinion, but rather the real question is what makes MLOps unique?
Many practical situations require periodic/frequent model building (also called training) because the data changes frequently in some tasks and domains. Automation is the best way to achieve it. Automated Continuous Integration / Continuous Deployment / Continuous Delivery (CI/CD/CD) is a must. Models need to be versioned and validated for accuracy constantly. Just like any artefact, they need to be managed and governed. General purpose hardware's and accelerator have equal share in this space even though accelerators have near dominance in extremely large models or extremely large scale data training.
领英推荐
IT Model Ops: There are multiple ways to deploy the model (also called inference) and multiple devices to which we deploy the model. Detached intelligence ensures that connectivity is not required to internet once the model is deployed in many edge devices to convert data to information and information to knowledge. The server side deployment is mostly exposed as web services for applications to consume. The models are deployed to varying hardware depending on purpose. What defines the hardware is the 4 P's which are Purpose, Performance, Power, & Price. CPU's still dominate the field even though they are general purpose hardware. Accelerators like GPU's or FPGA's or ASIC's are used when performance is a critical criteria. For most part, CPU's are sufficient.
There are techniques like eXtreme Model Optimization (XMO) which includes steps like
Such techniques are used in situations where model optimization leads to lesser compute and memory requirement at the cost of accuracy, at times. Edge devices which depends on battery power can get maximum benefits from XMO.
Sustainable Ops: Net Zero ML is the ultimate goal of the industry. To achieve it we need to take small steps like code/model optimization, increasing the efficiency of traditional algorithms or neural network topologies, transfer learning as opposed to full training, reducing overfitting by reducing the number of parameters to learn, reducing ML carbon foot print by mixed/low precision compute, etc. Many advances in software and hardware are making it easy for organizations to move towards the goal of Net Zero ML. We are far away from it, but steady progress is being made.
Human Centered Ops: ML models/pipelines suffer from the same issues humans suffer from, that is bias, ethics, fairness, interpretability, explainability, and responsibility. After all machines are learning from human collected data, human defined process, etc. Many regulations mandate that such issues be addressed prior to deploying the model. The goal is to address such issues as part of the ML pipeline. Progress is being made in each of this area. To an outside practitioner, it might look like "Competence without comprehension".
Sec Ops: Security is an integral part of any system. AI/ML is no way different. Since we learn patterns from data, data security is utmost important. Concepts like information poisoning should be monitored since it can happen over a long period of time. We are learnings from many open source datasets. They should be evaluated for relevancy. Model security becomes an issue when we use techniques like transfer learning or federated learning. Weight averaging from multiple sources can lead to issues that are not anticipated provided there are low trust partners as part of the ecosystem. App/services security is the next level which needs to be protected against malicious actors.
Drift Ops: Pattern drift happens all the time. The magnitude and direction differs based on task and domain. We learn patterns from data and data is subjected to drift. Ultimately we need adaptive systems which needs to incorporate data drift, concept drift, and model drift to be successful.
The end to end lifecycle of a machine learning system is similar to other software systems. The success of such systems depend on how well each of its parts are conceptualized, implemented, and operationalized. The failures of ML projects can be attributed to failures in understanding the end to end lifecycle. The synergy between people, process, and technology is of utmost importance.