AI/ML at Scale in Production- Common pitfalls and how best to avoid them
Executive Summary:
Just about a year ago, I was presenting at the Global Big Data and AI conference. There were about 100 technologists attending my talk.
I started asking – “Who in this virtual room has developed a ML learning or AI model for their business?” – Being a technology conference, almost 90% of the hands shot up.
Then I asked – “how many of you have that code in production at scale?”. Nearly every hand went down. ?
Obvious question is – Why are these ML models not in production? What's the root cause ? Or alternatively, if the models are not in production, what values are they providing to the business except probably completing proof-of-concept or completing a "science project" ?
Simply put, for AI and ML models to have measurable business impact, they must be brought to production at scale to solve business problems. Having completed “Proof-of-concept” is not enough to beat the AI hype.
AI/ML models must work in production at scale.
There are obviously good reasons behind this dismal statistics.
AI and ML model lifecycle is very complex and cyclic with lots of experimentations involved, of which the model creation phase is a very small piece of the overall lifecycle and complexity, particularly when we consider what it takes to bring models to production at scale to solve real-world business problems. The complexity lies in everything else other than just having a good and well-trained model.
Here everything else means complexities associated with automation, production operations, real-time monitoring, rapid and automated deployment of AI/ML models, dealing with model drifts, dealing with quality of served data, and model governance, just to name a few.
In fact, real complexities associated with AI and ML only begin when models are about to be deployed with scale, performance, security and auditability in mind in production.
Until we tackle these complexities head on, models will continue to gather dust and remain be in proof-of-concept stage in labs and will never see the light of the day.
5 Practical thoughts from the trenches of innovation as to how to tackle and overcome these complexities
1 AI/ML strategy must align with enterprise business strategy and data strategy.
Just like any enterprise-scale initiative, AI/ML initiative must start with understanding the Why (strategy) and What (business requirements) before embarking on How (ML solutions or NLP solution or a combination including heuristics) and Where (Infrastructure and tooling, cloud or on-premise or hybrid, AI on edge server, etc), meaning establishing a full alignment between business and the cross-functional data science teams, and agreeing on business use cases and how to measure success. ?The initiative must be tied to business KPIs to garner executive support, not just from budgetary perspective, but also through the usual thick and thin of the initiative itself. In all likelihood, the AI strategy needs to be tied to your product strategy.
Our experience suggests that most AI and ML projects don’t go into production because expectations are not well communicated with the business and there is no agreed upon definition of “What is good enough for AI to be successful?”.
Combining data science expertise with stakeholder business understanding is relevant to achieve tailored and actionable outcomes, creating a valuable experience for the business.?
2. Focus on data understanding, quality, security, privacy, and governance
Data is the lifeblood of AI. There is no AI and ML without good quality data. And often, you also need a lot of it, depending on your use case. The AI use case must be tied to your enterprise data strategy, meaning data understanding, whether you have the data that you need to train your model, whether data is already integrated and easily accessible, the quality of the data, whether this data is governed or not. Not to mention, technology and architecture play a big role in terms of data processing, data enrichment, data integration, scale, performance, and reliability. Data governance plays a big role to ensuring sustained data quality.
Data security and privacy represent a very broad, but important domain for any software application, and AI/ML-driven applications are no exception.??It is highly likely that ML applications that you are productizing will often involve personally identifiable information or, in health contexts, protected health information (PHI).?In addition to making sure your data and environments are well protected, there are specific considerations you should make for your deployed model. First, you should make sure to consider the risks of your model behaving badly. What would happen if your model produced the most erratic output you could imagine? What would be the impact on consumers of such predictions? What are the financial, reputational, security, or safety risks that could occur as a result? Depending on the severity of risks, you may want to implement extra guardrails against erroneous output.
领英推荐
3. Focus on building skill sets in AI, data engineering, cloud engineering, and modern data architecture and platform.
Data science is a team sport. It requires a very strong cross-functional team consisting of team members from business, software engineering, data engineering, software quality engineering, ML engineering, Operations, regulatory, compliance and Infrastructure. Skill sets required to pull off data science projects are also many, including knowledge and expertise in statistics, algorithms, data pipeline engineering, business domain, cloud engineering, test automation, modern data integration techniques, modern SQL, NoSQL and Big Data technology skills, APIs, data visualization techniques, among many. Knowledge about scaling software at an enterprise-scale plays a critical role as does end-to-end automation of the ML lifecycle.
4. Focus on AI/ML model operations at scale - MLOps
Businesses don’t realize the full benefits of AI and ML primarily because models are not deployed or even if they are deployed, they are not at the speed or scale to meet the needs of the business.
The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process.
Establish Continuous Integration (CI)?- Remember ML modeling is about data, model and code – all combined, unlike in typical software development where you only need to worry about code. Therefore, CI needs to be extended to test and validate data and models.
Establish Continuous Delivery (CD)?- this is for the delivery of ML pipeline that automatically deploys the model to production at scale.
Establish Continuous Training (CT)?- this is unique to ML systems property, which automatically retrains ML models for re-deployment. This is complex and this requires A/B testing, among other things.
Establish Continuous Monitoring (CM)?– this is to establish monitoring of production data and model performance metrics, which are bound to business metrics and performance.
Establish Model Versioning - In AI/ML projects, you need to version data, code and model. It’s way more complex than just versioning code in typical software development. You need to version data preparation pipelines, feature store, datasets, and metadata. As part of the modeling phase, you need to version ML model training pipeline, ML model objects, hyperparameters and experiment tracking. On the ML code side, you need to version code and configurations.
Establish Model Monitoring - Once the AI/ML model has been deployed; it needs to be monitored to assure that it performs as expected. This is very complex because model behavior is dependent on so many factors, including data changes, changes in source systems, and upgrade in other dependencies. For example, monitoring model drift, which is to say degradation of predictive quality of the model on served data, fits into this category. Both dramatic and slow-leak regression in prediction quality must be monitored and then rectified. It is important to identify the elements for monitoring and then create an actionable strategy for model monitoring before deploying the model to production. Continuous monitoring alleviates common production concerns including:
5. Focus on measuring model fairness, trust, interpretability and explainability
Businesses need time-consuming and costly audit processes to ensure compliance because of varied deployment processes, modeling languages and the lack of a centralized view of AI in production across organization. Instituting model governance helps with production access control, traceable model results.
Model fairness, interpretability and explainability are critical, not only for the business to understand, but also for data scientists, researchers, and developers as well. This way, we can explain the models and understand the value and accuracy of their findings. Interpretability is also important to debug machine learning models and make informed decisions about how to improve them.
Looking Ahead
AI/ML is not hard. What is painfully hard is implementing it and deploying it at scale in production in a methodical and disciplined manner following sound practices of software engineering, data engineering and data integration, AI/ML model SDLC, DevOps and MLOps while ensuring data privacy, security, compliance and model explainability.
This requires skillsets and expertise across a variety of functional and technical areas beyond just having AI/ML engineers, as depicted in this article.
Data Science is a true team sport.
While cloud computing, distributed computing, and availability of open source technologies, libraries and pre-trained models have certainly reduced the barrier to AI/ML entrance, businesses have painfully realized over the years that AI/ML is not just another technology that they can buy off the shelf, install it and start using it for immediate value creation. Instead, businesses have realized that AI/ML requires continuing executive level support, necessary funding and resources tied to business, product and data strategy.
Businesses are also starting to realize that they need to entertain failure, and encourage data science team to be bold, and use failures as learning opportunities so the next time the team can do better and be successful. This is a different mindset than what is normally required for developing enterprise application and platform.
The practical tips laid out in this article as best practices will benefit you and your team as you start thinking about instituting AI/ML for value creation at scale by bringing your models from labs to the wild.
As AI practitioner, that should be the goal, not just completing proof-of-concepts.
Independent Consultant | CCRP, RWD Operations Expert
2 年Thanks for this, Santi! From my perspective (which is the operations piece of the at-scale integration), items 4 and 5 are critical to making any AI/ML data usable downstream. The appropriate training and tracking of the accuracy and fair representation of the data is a key component to the future of AI/ML. There have been many examples of at-scale algorithm deployments that have introduced systemic bias because the data to run them originally was biased or because the CT, CM, and QA metrics/review were not put at the forefront of the planning and execution.
Advisor (Data, Analytics, GenAI) | Board Member | Ethical & Responsible AI | Diversity | Curious about Metaverses (AI+XR+Web3)
2 年Very insightful post, Prakriteswar Santikary, PhD Santi!!