MLOps Part 1: Building an End-to-End Training Pipeline with Vertex AI Pipelines
Fathima Rashadha
MACHINE LEARNING and DATA SCIENCE ENTHUSIAST | Associate AI/ML Engineer at NCINGA | Undergraduate | Artificial Intelligence and Data Science | Robert Gordon?University
Introduction to MLOps
Machine Learning Operations (MLOps) is essential for efficiently deploying, monitoring, and managing machine learning models in production environments. In this article, we will delve into the core components of MLOps using Google Cloud's Vertex AI, with a focus on training, deployment, and operationalizing machine learning models. Additionally, we will explore the integration of Kubeflow components within Vertex AI's serverless pipelines for continuous training. Future articles will cover advanced topics such as Vertex ML Metadata and feature store schema validations.
Choosing MLOps Techniques
When deciding which MLOps techniques to employ, it's crucial to consider the nature of your ML workflows and data processing requirements:
- Vertex AI Pipelines: Ideal for running pipelines built using Kubeflow Pipelines SDK v1.8.9 or higher, or TensorFlow Extended (TFX) v0.30.0 or higher. This setup is recommended for ML workflows handling large-scale structured or text data. TFX, by default, constructs a directed acyclic graph (DAG) of your ML pipeline and leverages Apache Beam for scalable, distributed execution across platforms like Google Cloud Dataflow and Apache Flink.
- TFX with Apache Beam and Cloud Dataflow: While powerful, this combination can be complex to configure and maintain. Hence, the use of orchestrators becomes pivotal.
- Orchestrators like Kubeflow: Simplify the configuration, operation, monitoring, and maintenance of ML pipelines. Kubeflow Pipelines, equipped with intuitive GUIs, provides robust scheduling and orchestration capabilities, particularly suitable for managing TFX pipelines.
- Choosing Kubeflow Pipelines SDK: For broader use cases beyond structured data processing, Kubeflow Pipelines SDK offers flexibility and ease of use in managing diverse ML workflows.
- Other Orchestrators: While alternatives like Cloud Composer (based on Apache Airflow) exist, Vertex AI Pipelines stands out with its native support for common ML operations and comprehensive metadata tracking. This lineage tracking feature is critical for validating pipeline integrity and performance in production environments.
By leveraging Vertex AI Pipelines and Kubeflow integrations, organizations can streamline their MLOps practices, ensuring efficient model deployment and reliable operationalization across various ML applications.
Install Necessary Libraries:
To install the necessary libraries (google-cloud-aiplatform, kfp, google-cloud-pipeline-components) in your Python environment, you can execute the following commands either in a Jupyter notebook cell or in your terminal (if you are working outside of a notebook environment):
These commands will download and install the latest versions of google-cloud-aiplatform, kfp (Kubeflow Pipelines SDK), and google-cloud-pipeline-components, enabling you to use them in your development and deployment workflows for machine learning operations on Google Cloud Platform.
Set Project Variables:
Sets up project-specific variables such as PROJECT_ID, REGION, and BUCKET_URI for storing pipeline artifacts.
Create Google Cloud Storage Bucket:
Creates a Google Cloud Storage (GCS) bucket with specified region and project ID.
Configure Service Account:
Sets up the service account name for authentication and authorization purposes.
Authenticate Service Account:
Checks and sets the service account dynamically based on the environment (local or Colab).
Set IAM Permissions for the Bucket:
Grants IAM roles (storage.objectCreator and storage.objectViewer) to the service account for the GCS bucket.
Import Necessary Libraries:
Imports essential libraries for interacting with Google Cloud AI Platform, Kubeflow Pipelines, and related components.
Set Pipeline Root Directory:
Defines the root directory in the GCS bucket where pipeline artifacts will be stored.
Dockerfile
Base Image and Working Directory
·?????? FROM: Specifies the base image. Here, it uses a Google Cloud Registry (GCR) image that includes TensorFlow 2 with GPU support (tf2-gpu.2-8). This base image comes with pre-configured CUDA and cuDNN libraries required for GPU acceleration.
·?????? WORKDIR: Sets the working directory to /root inside the container, which is where the following commands will be executed.
Installing TensorFlow
RUN pip install tensorflow: Ensures that the latest version of TensorFlow is installed. This might seem redundant because the base image already includes TensorFlow, but it can be useful to ensure the installation or update to a specific version if needed.
Installing Additional Python Packages
·?????? RUN pip3 install: Installs additional Python packages using pip3:
- gcsfs: For accessing Google Cloud Storage (GCS) filesystems.
- google-cloud-storage: Google Cloud client library for GCS.
- scikit-learn: Machine learning library for Python.
- pandas: Data manipulation and analysis library.
- numpy: Numerical computing library.
- argparse: For command-line option parsing in Python scripts.
Downloading and installing libtpu
·?????? RUN curl -L: Downloads the libtpu.so library from a specified URL.
·?????? -o /lib/libtpu.so: Saves the downloaded libtpu.so file to the /lib directory in the container.
·?????? libtpu.so: This shared library provides TPU support, enabling TensorFlow to run computations on TPU hardware if available.
Copying the Training Script
COPY train.py /root/train.py: Copies the local train.py file to the /root directory inside the container. This script contains the code to execute the training job.
Setting the Entry Point
·?????? ENTRYPOINT: Defines the command that will be run when the container starts.
Docker Image Creation
1. Enabling Artifact Registry Service
This command enables the Artifact Registry service in Google Cloud, which is required for storing and managing Docker images. Artifact Registry is a unified solution for container images and language packages, simplifying the management of these artifacts in Google Cloud projects.
2. Group and User Setup for Docker (Optional for Certain Environments)
This section checks if the script is running in a testing environment (IS_TESTING), and if so, updates various components of the Google Cloud SDK and Docker-related tools. Additionally, it creates a docker group and adds the current user to this group, which is a typical setup to allow non-root users to run Docker commands.
3. Creating a Docker Repository
This command creates a new Artifact Registry repository named ct-training-repository-final in the us-central1 region with the format set to docker. This repository will be used to store Docker images.
4. Building and submitting the Docker Image
·?????? TRAIN_IMAGE: Constructs the full Docker image URI using the region, project ID, repository name, and image name. This follows the format us-central1-docker.pkg.dev/{PROJECT_ID}/{REPOSITORY}/{IMAGE}:latest.
·?????? gcloud builds submit: This command submits the current directory's contents to Google Cloud Build to build a Docker image. The --tag flag specifies the image's tag, which in this case is latest. The --region=us-central1 ensures the build happens in the specified region.
Vertex AI Training and Deployment Setup
1. Initialize Vertex AI
First, need to initialize Vertex AI with Google Cloud project ID and the staging bucket where Vertex AI will store temporary files during training and deployment.
2. Import Required Libraries
Import the Vertex AI client library.
3. Define GPU and Machine Types
Specify the type and number of GPUs for both training and deployment. In this example, NVIDIA Tesla K80 GPUs are used.
4. Set Deployment Image
Define the container image for deploying the model. This image is pulled from the Artifact Registry and used for prediction.
5. Define Machine Types for Training and Deployment
Specify the machine types for training and deployment. Here, an n1-standard-16 machine type is used for training and n1-standard-4 for deployment.
6. Training Strategy
In this example, the training strategy is configured as a single-node training setup. Single-node training implies that the entire training process is carried out on a single machine or instance. This approach is suitable for smaller datasets and less complex models, where the computational resources of a single machine are sufficient to handle the training workload.
?
Training Strategies are critical in machine learning as they determine how the training workload is distributed across the computational resources available. This distribution can significantly impact the efficiency, speed, and scalability of training. Here’s an overview of different training strategies:
1.????? Single-Node Training
- Description: The model is trained on a single machine, which may have multiple GPUs. All training data and computations are localized to this single node.
- Use Cases: Ideal for smaller datasets or less complex models where the computational load can be managed by a single machine. It simplifies setup and reduces the complexity of managing multiple machines.
- Advantages: Easier to configure and manage, often with lower overhead for smaller tasks.
- Disadvantages: Limited by the capacity of a single machine, which may not be sufficient for larger datasets or more complex models.
?
领英推è
2.????? Distributed Training
- Description: The model training process is distributed across multiple machines or nodes. This can involve splitting the data (data parallelism) or the model itself (model parallelism).
- Use Cases: Required for large datasets or highly complex models where a single machine cannot handle the computational requirements or memory constraints.
- Advantages: Can significantly reduce training time and handle larger datasets by leveraging multiple machines.
- Disadvantages: More complex to configure and manage. It requires careful handling of synchronization and communication between nodes to ensure efficient training.
Distributed training strategies will be discussed in detail in future articles, where we will cover techniques like Data Parallelism and Model Parallelism.
In the single-node training setup, we define the following key parameters:
·?????? EPOCHS: The number of times the entire dataset is passed through the model during training. More epochs allow the model to learn better from the data but also increase training time.
·?????? STEPS: The number of batches of samples to be propagated through the model per epoch.
These parameters are passed to the training script using the TRAINER_ARGS list:
7. Working Directory and Model Display Name
- WORKING_DIR: This variable sets the directory where the model and other output files will be stored.
- MODEL_DISPLAY_NAME: A human-readable name for the model, used in Vertex AI.
8. Define Worker Pool Specifications
Configure the worker pool for the training job, specifying the machine type, number of replicas, and container image for the training code.
?
For more details and advanced configurations on Vertex AI, refer to the following Google Cloud documentation:
·?????? Vertex AI Custom Training (https://cloud.google.com/vertex-ai/docs/training/overview)
·?????? Vertex AI Deploying Models (https://cloud.google.com/vertex-ai/docs/general/deployment)
Introduction to Kubeflow Pipelines
Kubeflow Pipelines is a platform for building and deploying scalable machine learning (ML) workflows on Kubernetes. It provides an easy way to orchestrate, automate, and manage end-to-end ML workflows. By leveraging Kubeflow, you can create reusable components, automate model training, evaluation, and deployment, and ensure a robust CI/CD process for ML applications.
?
In a Kubeflow pipeline, you define a sequence of steps (components) that take data and produce outputs. These components are modular, allowing you to reuse them across different pipelines.
Creating Components in Kubeflow Pipelines
Kubeflow components are the building blocks of a pipeline. Each component performs a specific task, such as data preprocessing, model training, or evaluation. Components can be created using the @component decorator from the kfp library.
Here’s a breakdown of how to create and use components in a Kubeflow pipeline:
·?????? Define the Component: Use the @component decorator to define the component, specifying any required packages and the base image.
·?????? Implement the Function: Implement the function that performs the desired operation. This function will receive input parameters and output results.
·?????? Return Outputs: Use the NamedTuple to specify the outputs of the component, if there are any.
Check Skewness and Deploy Component
Purpose: This component evaluates the skewness of a specified column in a dataset and makes a deployment decision based on the skewness results. If the data is skewed, the component returns a decision to proceed with model training.
Explanation:
- Inputs: project, location, api_endpoint: GCP project and endpoint details. bucket_name, file_path: Location of the CSV file in Google Cloud Storage. target_column: Column to check for skewness. metrics: Artifact for logging evaluation metrics.
- Functionality: Loads data from GCS. Checks the skewness of the specified target column. Logs the skewness result. Makes a deployment decision based on the skewness value.
Custom Training Job Task
Purpose: This component initiates a custom training job on Google Cloud AI Platform using predefined specifications.
?
Explanation:
- Inputs: display_name: Human-readable name for the training job. worker_pool_specs: Specifications for the training job, including machine types and Docker images.
- Functionality: Executes the custom training job as per the provided configurations.
Import Unmanaged Model Task
Purpose: Imports a model that has been trained outside the Vertex AI environment into Vertex AI for further operations.
Explanation:
- Inputs: artifact_uri: Path to the directory where the model artifacts are stored. artifact_class: Class of the artifact being imported. metadata: Metadata including container specifications.
- Functionality: Imports the unmanaged model artifact for use in Vertex AI.
Model Upload Operation
Purpose: Upload the imported model to Vertex AI to make it available for deployment or further operations.
Explanation:
- Inputs: project, display_name: Project ID and display name for the model. unmanaged_container_model: Reference to the imported model artifact.
- Functionality: Uploads the model to Vertex AI, making it available for use in endpoints or further evaluations.
Regression Model Evaluation Metrics Component
Purpose: Evaluates the performance of the trained regression model by fetching metrics from Vertex AI or a specified GCS location and logs the results.
Explanation:
- Inputs: project, location, api_endpoint: GCP project and endpoint details. model: Reference to the model artifact. metrics: Artifact for logging evaluation metrics. gcs_input_path, gcs_output_path: Paths to input and output locations in GCS.
- Functionality: Fetches model evaluation metrics from Vertex AI. Logs the metrics. Makes a deployment decision based on the evaluation results.
Endpoint Creation Operation
Purpose: Creates an endpoint in Vertex AI to deploy the trained model for serving predictions.
Explanation:
- Inputs: project, display_name: Project ID and display name for the endpoint.
- Functionality: Creates an endpoint in Vertex AI where the model can be deployed and accessed for serving predictions.
Pipeline Definition
The pipeline orchestrates the execution of the above components based on the conditions set by the skewness check and model evaluation metrics.
- Check Skewness Task (check_skewness_and_deploy): This component checks the skewness of a specified target column in a CSV file located in a Google Cloud Storage (GCS) bucket. It determines whether the data is skewed based on predefined thresholds. Outputs a decision (dep_decision) on whether to proceed with further pipeline steps (true or false).
- Conditional Execution (proceed_with_training): This condition checks the output of the check_skewness_task. If dep_decision is true, it proceeds with the training steps; otherwise, it skips them.
- Custom Job Task (custom_job_task): Initiates a custom training job, such as training a TensorFlow model on a TPU (Tensor Processing Unit). Uses specified worker pool configurations (WORKER_POOL_SPECS).
- Import Unmanaged Model Task (import_unmanaged_model_task): Imports an unmanaged containerized model from a specified artifact URI (WORKING_DIR). This model is used for further evaluation and deployment.
- Model Upload (model_upload_op): Uploads the trained model (from custom_job_task) or imported model (from import_unmanaged_model_task) to Google Cloud's AI Platform. Specifies metadata like project ID, model display name, and model type (unmanaged container model).
- Model Evaluation Task (model_eval_task): Evaluates the model's performance metrics, such as Mean Absolute Error (MAE), using evaluation data stored in GCS. Logs these metrics for analysis and comparison with previous model versions. Outputs a decision (dep_decision) on whether to deploy the model based on the evaluation metrics.
- Conditional Deployment Decision (deploy_decision): This condition checks the output of model_eval_task. If dep_decision is true, it proceeds with deployment; otherwise, it skips deployment.
- Endpoint Creation (endpoint_create_op): Creates an endpoint on Google Cloud's AI Platform for serving the model predictions. Specifies project ID and endpoint display name.
- Model Deployment (ModelDeployOp): Deploys the model to the created endpoint. Configures deployment settings such as compute resources (CPU/GPU), replica counts, and accelerator type (if applicable).
How They Connect:
- The pipeline starts by checking data skewness (check_skewness_task).
- If the data is skewed (dep_decision is true), it proceeds with training (custom_job_task), model import (import_unmanaged_model_task), model upload (model_upload_op), and model evaluation (model_eval_task).
- If the evaluation metrics are satisfactory (dep_decision is true), it proceeds with endpoint creation (endpoint_create_op) and model deployment (ModelDeployOp).
Documentation Links:
·?????? Kubeflow Pipelines Documentation (https://www.kubeflow.org/docs/components/pipelines/)
Compile the Pipeline
This compiles your pipeline function defined using the Kubeflow Pipelines DSL (@kfp.dsl.pipeline) into a JSON format specified by package_path="ct_train.json". The JSON file (ct_train.json) contains the serialized definition of your pipeline.
Define Pipeline Job
·?????? define a PipelineJob object (job) using the AI Platform SDK (aip).
·?????? DISPLAY_NAME: Specifies the display name for the pipeline job.
·?????? template_path: Points to the compiled pipeline template (ct_train.json).
·?????? pipeline_root: Specifies the root directory for storing pipeline artifacts and metadata (PIPELINE_ROOT).
Run the Pipeline Job
·?????? This method call (job.run()) triggers the execution of your compiled pipeline (ct_train.json) on Kubeflow Pipelines.
·?????? The pipeline execution starts from the beginning (pipeline function definition), following the steps you've defined in pipeline.
?
Summary:
Here, we built a comprehensive end-to-end training pipeline that seamlessly integrates data skewness analysis into Vertex AI pipelines(https://cloud.google.com/vertex-ai/docs/pipelines/introduction). This serverless approach automates the entire workflow from training to deployment. Future articles will delve into advanced topics such as Vertex ML Metadata, Feature Store, TensorFlow Data Validation, and Apache Beam pipelines, further enhancing our understanding and capabilities in machine learning operations.
?
?
?
?
?
?
?
?
?
?
?