How to orchestrate MLOps by using Azure Databricks?

How to orchestrate MLOps by using Azure Databricks?

In this article, we will explore how to effectively orchestrate MLOps using Azure Databricks. We'll delve into a comprehensive architecture and process that streamlines the movement of machine learning models and pipelines from development to production. Whether you prefer automated or manual processes, this approach establishes a standardized framework for efficient MLOps implementation.

Architecture

No alt text provided for this image
ML Ops Databricks architecture

Workflow

By leveraging Azure Databricks, this solution offers a powerful MLOps process that is both reliable and flexible. The architecture is designed with pluggability in mind, allowing seamless integration of various Azure and third-party services based on your specific requirements. This enables you to customize and enhance the architecture to meet your unique needs and leverage additional tools and services within the ecosystem.

  • Version Control: The code repository for this project serves as a centralized hub for notebooks, modules, and pipelines. Data scientists utilize development branches to test updates and new models. Leveraging Git, the code can be developed either within notebooks or in integrated development environments (IDEs), with seamless synchronization to Azure Databricks workspaces through Databricks Repos integration. This source control mechanism facilitates the progression of machine learning pipelines from the development stage to staging (for testing) and ultimately to production (for deployment).
  • Lakehouse - Production Data: Within the development environment, data scientists have read-only access to production data, ensuring data integrity. As an alternative, data can be mirrored or redacted to maintain privacy and security. Additionally, data scientists are granted read/write access to a dedicated dev storage environment, enabling them to experiment and iterate on their work. For optimal data management, we recommend adopting a Lakehouse architecture, which involves storing data in Delta Lake format within Azure Data Lake Storage. Access controls are established using Azure Active Directory credential passthrough or table access controls, ensuring robust security measures are in place.

Development

Within the development environment, data scientists and engineers collaborate on the creation of machine learning pipelines.

  1. Exploratory Data Analysis (EDA): Data scientists engage in an iterative and interactive process of exploring the data. This exploratory work may not necessarily be deployed to staging or production environments. Various tools such as Databricks SQL, dbutils.data.summarize, and AutoML can be employed for effective EDA.
  2. Model Training and Other Machine Learning Pipelines: Machine learning pipelines are developed using modular code within notebooks and/or integrated development environments (IDEs). For example, the model training pipeline leverages data from the Feature Store and other tables within the Lakehouse architecture. During training and tuning, model parameters and metrics are logged to the MLflow tracking server. The Feature Store API ensures that the final model is logged, creating a link between the model, its inputs, and the training code.
  3. Code Commitment: To advance the machine learning workflow towards production, the data scientist commits the code responsible for featurization, training, and other pipelines to the source control repository. This step marks the progression of the code within the development lifecycle.

Staging

In the staging environment, changes to machine learning pipelines undergo rigorous testing within an environment that closely resembles the production setup.

  1. Merge Request: When a merge or pull request is submitted against the staging (main) branch of the project in the source control repository, a robust continuous integration and continuous delivery (CI/CD) tool such as Azure DevOps is employed to initiate tests.
  2. Unit and CI Tests: Unit tests are executed within the CI infrastructure, ensuring the individual components of the pipelines function correctly. Integration tests are then performed, validating end-to-end workflows on Azure Databricks. If all tests pass successfully, the code changes are merged.
  3. Release Branch Creation: Once machine learning engineers are confident in deploying the updated machine learning pipelines to the production environment, they proceed to build a new release. The CI/CD tool facilitates this process by orchestrating a deployment pipeline that redeploys the updated pipelines as new workflows in the production environment.

Production

Production Environment Management: Machine learning engineers are responsible for overseeing the production environment, where machine learning pipelines directly serve end applications. Key pipelines in the production environment encompass tasks such as refreshing feature tables, training and deploying new models, conducting inference or serving, and monitoring model performance.

  1. Feature Table Refresh: This pipeline is designed to read data, compute features, and update the Feature Store tables. It operates continuously in streaming mode, based on a predefined schedule, or triggered by specific events.
  2. Model Training: In the production environment, the model training or retraining pipeline is either triggered or scheduled to train a new model using the latest production data. The trained models are registered within the MLflow Model Registry.
  3. Continuous Deployment: The registration of new model versions triggers the continuous deployment (CD) pipeline, which executes a series of tests to ensure the model's suitability for production deployment. These tests encompass performance evaluations, compliance checks, A/B comparisons against the current production model, and infrastructure validations. The Model Registry keeps track of the model's progress through various stage transitions, with the option to incorporate automation using registry webhooks. Test results and metrics are recorded in Lakehouse tables. Additionally, manual sign-offs can be implemented as an optional step before transitioning models to the production stage.
  4. Model Deployment: Upon entering the production stage, the model is deployed for scoring or serving. The most common deployment modes include:

  • Batch or Streaming Scoring: For scenarios with longer latencies, batch and streaming options are the cost-effective choices. The scoring pipeline retrieves the latest data from the Feature Store, loads the most recent production model version from the Model Registry, and performs inference within a Databricks job. The resulting predictions can be published to Lakehouse tables, Java Database Connectivity (JDBC) connections, flat files, message queues, or other downstream systems.
  • Online Serving (REST APIs): Low-latency use cases typically necessitate online serving. MLflow enables model deployment to MLflow Model Serving on Azure Databricks, cloud provider serving systems, and other compatible platforms. In all cases, the serving system initializes with the latest production model from the Model Registry. For each request, it fetches features from the online Feature Store and generates predictions accordingly.

  1. Monitoring: Continuous or periodic workflows are implemented to monitor input data and model predictions for drift, performance, and other relevant metrics. Delta Live Tables can simplify the automation of monitoring pipelines, storing the metrics in Lakehouse tables. Databricks SQL, Power BI, and other tools can then access these tables to create dashboards and alerts.
  2. Retraining: This architecture accommodates both manual and automatic retraining approaches. Scheduled retraining jobs serve as a straightforward method to ensure that models remain up to date and relevant.


Components?

1.??Data Lakehouse. A Lakehouse architecture unifies the best elements of data lakes and data warehouses, delivering data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.

1.1.Delta Lake?is the recommended choice for an open-source data format for a lakehouse. Azure Databricks stores data in Data Lake Storage and provides a high-performance query engine.

2.??MLflow?is an open-source project for managing the end-to-end machine learning lifecycle. These are its main components:

2.1.Tracking?allows you to track experiments to record and compare parameters, metrics, and model artifacts.

2.1.1.???????????Databricks Autologging?extends?MLflow automatic logging?to track machine learning experiments, automatically logging model parameters, metrics, files, and lineage information.

2.2.MLFlow Model?allows you to store and deploy models from any machine learning library to various model serving and inference platforms.

2.3.Model Registry?provides a centralized model store for managing model lifecycle stage transitions from development to production.

2.4.Model Serving?enables you to host MLflow models as REST endpoints.

3.??Azure Databricks. Azure Databricks provides a managed MLflow service with enterprise security features, high availability, and integrations with other Azure Databricks workspace features.

3.1.Databricks Runtime for Machine Learning?automates the creation of a cluster that's optimized for machine learning, preinstalling popular machine learning libraries like TensorFlow, PyTorch, and XGBoost in addition to Azure Databricks for Machine Learning tools like AutoML and Feature Store clients.

3.2.Feature Store?is a centralized repository of features. It enables feature sharing and discovery, and it helps to avoid data skew between model training and inference.

3.3.Databricks SQL. Databricks SQL provides a simple experience for SQL queries on Lakehouse data, and for visualizations, dashboards, and alerts.

3.4.Databricks Repos?provides integration with your Git provider in the Azure Databricks workspace, simplifying collaborative development of notebooks or code and IDE integration.

3.5.Workflows?and?jobs?provide a way to run non-interactive code in an Azure Databricks cluster. For machine learning, jobs provide automation for data preparation, featurization, training, inference, and monitoring.

Scenarios

MLOps plays a crucial role in mitigating the risks associated with failures in machine learning and AI systems while enhancing collaboration efficiency and tooling. For a comprehensive understanding of MLOps and an overview of the discussed architecture, refer to the "Architecting MLOps on the Lakehouse" resource.

By adopting this architecture, you gain the following benefits:

  1. Bridging the Gap: This architecture facilitates seamless collaboration between business stakeholders and machine learning/data science teams. Data scientists can leverage notebooks and IDEs for development, while business stakeholders can access metrics and dashboards through Databricks SQL, all within the unified Lakehouse framework.
  2. Data-centric Infrastructure: This architecture treats machine learning data, including data from feature engineering, training, inference, and monitoring, on par with other types of data. By adopting a unified approach, it leverages existing tooling for production pipelines, dashboarding, and general data processing, promoting consistency and reuse across the entire data ecosystem.
  3. Modularized Pipelines and Code: Following the best practices of software engineering, this architecture employs modularized pipelines and code. This approach allows for efficient testing of individual components and reduces the cost of future refactoring, ensuring scalability and maintainability.
  4. Automation Flexibility: The architecture supports the automation of MLOps processes based on specific needs. Azure Databricks offers a range of options for automation, allowing steps to be automated as required. While automation improves productivity and reduces the risk of human error, manual processes and UI interactions are also supported. The flexibility provided by Azure Databricks ensures that automation can be selectively applied to optimize workflows.

By adopting this architecture, organizations can effectively implement MLOps practices, fostering robustness, collaboration, and efficiency in their machine learning and AI endeavors.

Potential Use Cases:

Here are three potential use cases that showcase the ML Ops capabilities of the Azure Databricks architecture:

1. Automated Model Retraining and Deployment:

  • Use Case: A retail company wants to continuously improve its demand forecasting model using the latest data and deploy updated models seamlessly.
  • Solution: The company can leverage the Azure Databricks architecture for ML Ops to automate the model retraining and deployment process. They can set up a scheduled job that triggers the model training pipeline periodically. The pipeline reads the latest data from the Lakehouse architecture, performs feature engineering, trains the model, and registers it in the MLflow Model Registry. The CD pipeline is then triggered, running tests to ensure the model's performance in production. If the model passes the tests, it is automatically deployed as a new version in the serving environment, making it available for real-time inference.

2. Continuous Model Monitoring and Alerting:

  • Use Case: A financial institution needs to monitor the performance and accuracy of their credit risk model to ensure compliance and timely detection of any anomalies.
  • Solution: Using the ML Ops capabilities of Azure Databricks, the financial institution can implement continuous model monitoring and alerting. They can set up workflows and jobs to periodically execute a monitoring pipeline that fetches new data, performs inference using the deployed model, and compares the predicted outcomes with the actual outcomes. Metrics such as accuracy, precision, and recall are computed and stored in the Lakehouse tables. If any metric falls below a specified threshold or if there are significant deviations, alerts are triggered to notify the relevant stakeholders, enabling them to take immediate action to address any issues.

3. Collaborative Development and Experimentation:

  • Use Case: A healthcare organization wants to foster collaboration between data scientists and domain experts to develop and experiment with various machine learning models.
  • Solution: Leveraging the Azure Databricks architecture, the healthcare organization can create a collaborative environment for data scientists and domain experts. They can utilize Databricks Repos to enable version control integration with Git, simplifying collaborative development of notebooks and code. Data scientists can leverage notebooks and IDEs for developing machine learning pipelines, while domain experts can use Databricks SQL to explore and analyze data, create visualizations, and derive insights. The Lakehouse architecture ensures that all stakeholders can access consistent and up-to-date data for their experimentation and analysis, promoting effective collaboration and knowledge sharing.

These use cases demonstrate how the ML Ops capabilities of Azure Databricks can be applied to automate model retraining and deployment, facilitate continuous monitoring and alerting, and foster collaborative development and experimentation within organizations.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了