Machine Learning Ops for Public Sector

Dave Enright

Data and AI Senior Architect, Microsoft Technology Centre

发布日期: 2022年7月4日

Introduction.

Machine Learning DevOps (MLOps) is an organization change that relies on a combination of People, Process, and Technology to deliver Machine Learning Solutions in a robust, scalable, and reliable way.

MLops is particularly important because it enables organisations to bring their models from experiments into actual value generating exercises.

More broadly, machine learning is becoming central to Singapore government agencies, with the Singapore government enacting the National AI Strategy that’s targeted across all of government, but also in particular seven key pillars:

Let’s begin with why MLops is important for Singapore government, and how Microsoft can help with the adoption of Machine Learning capabilities more broadly.

Why MLOps??

Modern Machine learning algorithms and frameworks are making it increasingly easy to develop models that can make accurate predictions.?

You may have built a fantastic machine learning model that exceeds all your accuracy expectations and impresses your sponsors, so it is now time to deploy the model into production. Unfortunately, it is not as easy as you had anticipated: there are likely many things to put in place before your model can finally be put to use. ?

Over time, you or one of your colleagues may develop a new model that could perform better than the old model, but can you carefully implement it without potentially disrupting business? It may also be necessary for regulatory purposes to recreate the model and explain the model’s predictions when unusual or biased predictions are made. Data inputted to your training and model can change over time and it may be necessary to retrain the model periodically to maintain the accuracy of its predictions. Who will have responsibility to feed the data, monitor the performance, retrain the model and fix it should it fail??

If you experience these problems, you may want to consider implementing an MLOps strategy for your project. At a high level MLOps refers to the application of DevOps principles to AI-infused applications.

Let’s consider one very common use case: Suppose we have an application that serves a model’s predictions via an API. Even such a simple use case can face many issues in production. Some MLOps tasks fit well in the general DevOps framework, such as setting up unit tests and integration tests, or tracking changes through version control. Other tasks are more unique to MLOps, such as:?

how to enable continuous experimentation and comparison against a baseline model;?
how to monitor the incoming data to detect data drift;?
how trigger model retraining and set up a rollback just in case; and?
how to create reusable data pipelines that can be leveraged for both training and scoring.?

Ultimately, the goal of MLOps is to close the gap between development and production and deliver value to faster. To achieve this, we need to rethink how things are done in development and in production. To what extent Data Scientists specifically are expected to be involved in MLOps is an organizational choice, as the role of Data Scientist itself is defined differently across different organizations. We recommend you check out the MLOps maturity model to see where you are and where you want to be on the maturity scale.?

How Machine Learning DevOps is different than DevOps?

Data Science projects are different from App Dev or Data Engineering projects. Data Science projects may or may not make it to production. After an initial analysis, it might become clear that the business outcome cannot be achieved with the available datasets. Due to this reason, an exploration phase is usually the first step in a Data Science project.

The objective in this phase is to define and refine the problem and run exploratory data analysis, in which statistics and visualizations are used in order to confirm or falsify the problem hypotheses. There needs to be a common understanding that the project may not extend beyond this phase. It is important to make this phase as seamless as possible in order to have a quick turnaround. Unless there is an element of security which enforces processes and procedures, they should be avoided and the Data Scientist should be allowed to work with the tool and data of their choice. Real data is needed for data exploration work.?

?The experimentation and development stage usually begin when there is enough confidence that the Data Science project is feasible and can provide real business value. Hence it is the stage at which dev practices become increasingly important. It is a good practice to capture metrics for all the experiments that are done at this stage, and to incorporate source control so that it is possible to compare models and go back and forth between various versions of the code if needed.

Development activities include the refactoring, testing and automation of exploration code into repeatable experimentation pipelines, as well the creation of model serving applications and pipelines. Refactoring code into more modular components and libraries helps increase reusability and testability, and it allows for performance optimization.

Finally, what is deployed into staging and production environments is the model serving application or batch inference pipelines. Next to monitoring of infrastructure reliability and performance, similarly to what’s done for a regular application with traditional DevOps, the quality of the data, the data profile, and model must be continuously monitored at the risk of degradation or drift. ML models require retraining over time to stay relevant in a changing environment.? ?

Seven principles to Machine Learning DevOps?

When looking to adopt MLOps for your next machine learning project, consider applying the following core principles as the foundation to any project.?

1.?????Version control code, data and experimentation outputs?

Unlike traditional software, data has a direct influence on the quality of machine learning models. Besides versioning your experimentation code base, version your datasets to ensure reproducibility of experiments or inferencing results. Versioning experimentation outputs like models can save effort and the computational cost of recreation.??

2.?????Use multiple environments?

To segregate development and testing from production work, replicate your infrastructure in at least two environments. Access control for users might differ in each environment.?

3.?????Manage infrastructure and configurations-as-code?

When creating and updating infrastructure components in your work environments, make use of infrastructure-as-code to prevent inconsistencies between environments. In addition, manage machine learning experiment job specifications as code, so that you can easily rerun and reuse a version of your experiment across environments.?

4.?????Track and manage machine learning experiments?

Track the performance KPIs and other artifacts of your machine learning experiments. Keeping a history of job performance allows for a quantitative analysis of experimentation success and enables for greater team collaboration and agility.??

5.?????Test code, validate data integrity, model quality?

Test your experimentation code base including correctness of data preparation functions, featurizers, checks on data integrity, as well as obtained model performance.??

6.?????Machine Learning Continuous Integration and Delivery?

Use continuous integration to automate test execution in your team. Include model training as part of continuous training pipelines, and include A/B testing as part of your release, to ensure that only a qualitative model may land in production.?

7.?????Monitor Services, Models and Data?

When serving machine learning models in an operationalized environment, it is critical to monitor these services for their infrastructure uptime and compliance, as well as for model quality. Set up monitoring to identify data and model drift, to understand whether retraining is required or to set up triggers for automatic retraining.?

MLOps Best Practices with Azure Machine Learning?

Azure Machine Learning offers several asset management, orchestration, and automation services to help you manage the lifecycle of your model training and deployment workflows. This section discusses best practices and recommendations in applying MLOps across the areas of People, Process and Technology supported by Azure Machine Learning.?

People

Work in project teams to best utilize specialist and domain knowledge in your organization. Organize and set up Azure ML Workspaces on a project basis to comply with use case segregation requirements.?
Define a set of responsibilities and tasks in your organization as a role, where one team member on an MLOps project team could fulfill multiple roles. Use Custom Roles in Azure to define a set of granular RBAC operations for Azure Machine Learning that each role can perform.??
Standardize on a project lifecycle and agile methodology. The Team Data Science Process provides a reference lifecycle implementation.?
Balanced teams can execute all MLOps stages from exploration to development to operations.

Process??????????

Standardize on a code template to allow for code reuse and increase ramp up time at project start or when a new team member joins the project. Azure ML pipelines and job submissions script, as well as CI/CD pipelines lend themselves well for templatization.?
Use version control. Jobs that are submitted from a Git-backed folder automatically track repo metadata with the job in Azure ML for reproducibility.??
Version experiment inputs and outputs to enable reproducibility. Use Azure ML Datasets, Model management and Environment management capabilities to facilitate.?
Build up a run history of experiment runs to allow for comparison, planning and collaboration. Make use of an experiment tracking framework like MLFlow for metric collection.?
Continuously measure and control the quality of your team’s work through continuous integration on the full experimentation code base.?
Early-terminate training when a Model does not converge. Use an experiment tracking framework in combination with the run history in Azure Machine Learning to monitor job execution.?
Define an experiment and model management strategy. Consider using naming e.g. “Champion” to refer to the current baseline model, or refer to “Challenger’ models for candidate models which could outperform the “Champion” model in production. Leverage tags in Azure Machine Learning to mark experiment and models as appropriate. In some scenarios, such as sales forecasting it can take months to determine whether the model’s predictions are accurate.?
Elevate Continuous Integration to Continuous Training by including model training as part of the build. For instance, initiate model training on the full dataset with each pull request.?
Shorten time-to-feedback on the quality of machine learning pipeline by running automated build just on a sample of the data. Use Azure ML Pipeline parameters to parameterize input Datasets.?
Use Continuous Deployment for Machine Learning models to automate the deployment and testing of real time scoring services across your Azure environments (dev, test, prod).?
In some regulated industries, model validation steps may be required before a machine learning model can be used in a production environment. By automating validation steps (to an extent) one might be able to accelerate time to delivery. When manual review or validation steps are still the bottleneck, consider whether it is possible to certify the automated model validation pipeline. Use resource Tags in Azure Machine Learning to indicate asset compliance, candidates for review, or as triggers for deployment.?
Do not retrain in production and directly replace the production model without any integration testing. Even though model performance and functional requirements are good, amongst other potential issues, a model might have grown in footprint breaking the serving environment.?
When production data access is only available in production, use RBAC and custom roles to give a select number of ML practitioners the read access they require e.g. for data exploration. Alternatively, make a data copy available in the non-production environments.?
Agree on naming conventions and tags for Azure Machine Learning Experiments to differentiate retraining baseline machine learning pipelines from experimental work.?

Technology

When typically submitting jobs via the Studio UI or CLI interface, instead of submitting jobs via the SDK, use the CLI or Azure DevOps Machine Learning Tasks to configure automation pipeline steps. This might reduce code footprint by reusing the same job submissions directly from automation pipelines.?
Use event-based programming. For instance, trigger an offline model testing pipeline using an Azure Function once a new model gets registered. Or send a notification to an `Ops` email alias when a critical pipeline fails to run. Azure ML produces events to Event Grid that you can subscribe to.?
When using Azure DevOps for automation, make use of Azure DevOps Tasks for Machine Learning to use Machine Learning models as pipeline triggers.?
When you are developing Python packages for your machine learning application, you can host them in an Azure DevOps repository as artifacts and publish them as a feed.?This approach allows you to integrate the DevOps workflow for building packages with your Azure Machine Learning Workspace.?
Consider using a staging environment to system integration test ML pipelines with upstream or downstream application components.?
Create unit and integration test for your Inference Endpoints for enhanced debuggability and accelerated time to deployment.??
To trigger retraining, make use of Dataset Monitors and use event-driven workflows to subscribe to Data drift Events and automate the triggering of machine learning pipelines for retraining.?

Ethics?

Finally it's important to discuss ethics in AI. Ethics play an instrumental role in the design of an AI solution – especially when it comes to government services. Without implementing ethical principles, trained models can exhibit the same bias present in the data they were trained on. This can result in the project being discontinued and more importantly, it can risk the organization’s reputation.??

In order to ensure that the key ethical principles that the company stands for are implemented across projects, a list of these principles along with ways of validating them from a technical perspective during the testing phase should be provided.?Consider making use of the Responsible ML features in Azure Machine Learning.?

Get started today:

Introduction to machine learning operations (MLOps) - Learn | Microsoft Docs

Korkrid (Kyle) Akepanidtaworn

AI Specialized Cloud Solution Architect (AI Ranger) @ Microsoft | Enterprise AI, GenAI, LLM, LLamaIndex, ML | GenAITechLab Fellow, MScFE at WorldQuant, MSDS at CU Boulder

2 年

Helpful! Well-written ????

1 次回应

Matthew Hardman

Hybrid Cloud | High Performance Applications | Data Ops | Strategy | Leadership

2 年

Dude massive effort and incredible depth in your article. I reckon it needs another reread to soak it all in. Thanks for taking the time to write it Dave Enright