Serverless MLflow Tracking in Google Cloud Run
Source icons and logos on image: https://cloud.google.com/icons?hl=en

Serverless MLflow Tracking in Google Cloud Run

Experiment tracking is one of the cornerstones of any serious machine learning project. Similar to a scientist’s lab notebook, it helps data scientists keep a record of all experiments, their parameters, and associated outcomes, like performance metrics. It frees them up to efficiently and reliably experiment with different feature and model engineering strategies and compare them easily. A multitude of free and commercial solutions have emerged in this space, each promising to make experiment tracking easier than ever.

However, there’s a catch: deploying and operating such (open-source) tools over the lifecycle of one or multiple machine learning projects is anything but trivial. While commercial SaaS solutions can alleviate this pain, they bring their own trade-offs, for example when it comes to data sovereignty and overall costs.

This tutorial is aimed at data scientists, machine learning engineers, or (Dev)Ops engineers in small teams. It provides a starting point for your MLOps efforts by demonstrating how to implement experiment tracking while adhering to good engineering practices when it comes to deployment and operations. Even if you don't intend to use experiment tracking in your ML workflow, you may find this information valuable: The European Union recently passed the AI Act , a regulation that could impact your use cases and may eventually require you to implement such practices.

MLflow is a widely used, feature-rich experiment tracking (and more) platform, and is distributed under a permissive open-source license. We are going to look at a cost-effective way of deploying an MLflow tracking server on Cloud Run in Google Cloud Platform (GCP). Highlights include:

  • Persistent state storage with automated backups
  • Authentication for access control
  • Flexible scaling based on request load with automatic load balancing
  • Automatic DNS record and SSL certificate management
  • Dependable and reproducible deployment through Pulumi as an infrastructure-as-code tool

By the end of this article, you’ll be equipped with the essential tools and resources to deploy your own MLflow tracking server in Google Cloud and embark on your journey towards Trustworthy AI by keeping a record of all machine learning experiments. While this particular solution focuses on Google Cloud Platform, it can easily be adapted for other cloud providers. No extensive knowledge of either of the tools is assumed, as the tutorial aims to offer explanations and links to appropriate resources throughout. The steps have been tested on recent versions of macOS and Linux.

The code for this project is available on GitHub if you prefer to follow along with the post with the final result or play around with the code by yourself.

The Big Picture ??

There are multiple ways of deploying an MLflow tracking server (refer to the MLflow docs for more information), which provides the service interface for the MLflow client library (utilized in your ML code) and a user-friendly web UI. The choice of deployment architecture depends on where you want to store tracking metadata and artifacts.

For the proposed architecture in Google Cloud, we are going for a remote tracking server that keeps metadata (and authentication information) in a Cloud SQL database and artifacts in a Cloud Storage bucket. Clients can access the tracking server via a secure HTTPS endpoint with load balancing and HTTP Basic Auth for added security.

Let’s have a look at the major components of the architecture which we will build up in the remainder of this tutorial:

Clients (whether humans using the web UI or the mlflow Python client package or command-line tool) access the tracking service deployed in Cloud Run via authenticated HTTPS requests (label 1??). The tracking server stores tracking metadata (label 2??) as well as user accounts and permissions (label 3??) in a Cloud SQL database instance. GCP performs automatic incremental backups with point-in-time recovery of these databases on a nightly schedule (label 4??). Logged artifacts are securely stored in a Cloud Storage bucket (label 5??). The custom MLflow container image used to launch Cloud Run service instances is hosted in an Artifact Registry repository (label 6??). Authentication configuration for the MLflow service is confidential within Secret Manager (label 7??).

This article continues by demonstrating how to connect and deploy each component of the proposed architecture: After completing the prerequisites, we will see how to containerize MLflow and store the image in Artifact Registry. Then, we will declare the necessary infrastructure for storing tracking metadata and artifacts and configure authentication. Putting everything together, we will then deploy the actual MLflow tracking service in Cloud Run and expose it to the Internet. Finally, we will look at a demo ML experiment in Python to test the stack and discuss the costs for this setup.

Prerequisites

The architecture developed in this article builds on top of various tools and technologies. To follow along, you will need the following prerequisites on your machine:

Additionally, you need a Google Cloud Platform account and access to a project (with appropriate permissions). If you do not have an account already, you can get $300 in free credits when you sign up , which is more than enough to follow along with this tutorial (and even operate the deployed instance for quite some time afterward).

If you are creating a new Google Cloud project, you have to manually enable the Compute Engine API in Cloud Console before you can provision the rest of the infrastructure using Pulumi.

Setting up a new Pulumi project ??

Creating a trustworthy cloud infrastructure involves making the process consistent and repeatable. Infrastructure-as-code (IaC) software can help to separate the what? from the how? when it comes to defining and deploying resources to a cloud platform such as GCP.

In this article, we will use Pulumi , a popular open-source IaC tool, which offers runtimes for a variety of high-level programming languages, including TypeScript/JavaScript, Python, and Go. By employing Pulumi, we will define and set up all components of our infrastructure in a declarative manner, using the TypeScript runtime. If you want to get a better overview of Pulumi and its capabilities, please check out the official documentation .

Although it is possible to achieve the same goal using Terraform or even just the gcloud CLI tool, Pulumi offers a more enjoyable developer experience, with features such as rich IDE support. Additionally, having the flexibility to use multiple runtime languages can significantly boost productivity and facilitate the introduction of IaC practices within new teams.

As a first step, we will create an empty Pulumi project based on a ready-made template (not surprisingly called gcp-typescript). In an empty directory for your project, run the Pulumi project bootstrap command (pulumi new gcp-typescript) and answer the questions or accept the defaults:

$ mkdir <path-to-project> && cd <path-to-project>
$ pulumi new gcp-typescript
This command will walk you through creating a new Pulumi project.

Enter a value or leave blank to accept the (default), and press <ENTER>.
Press ^C at any time to quit.

project name (gcp-mlflow-cloud-run): <YOUR PROJECT NAME>
project description (A minimal Google Cloud TypeScript Pulumi program): Serverless MLflow in Google Cloud Run 
Created project 'gcp-mlflow-cloud-run'

Please enter your desired stack name.
To create a stack in an organization, use the format <org-name>/<stack-name> (e.g. `acmecorp/dev`).
stack name (dev): <YOUR STACK NAME>
Created stack 'dev'

gcp:project: The Google Cloud project to deploy into: <YOUR PROJECT ID>
Saved config

Installing dependencies...

[... omitted ...]

Your new project is ready to go! ?

To perform an initial deployment, run `pulumi up`
        

Pulumi uses these questions to generate the initial project structure, set up all required TypeScript dependencies, and configuration for a new stack , the basic building block of every Pulumi program.

Before we can deploy the stack to GCP, we need to set the default region (e.g., europe-west3) in the stack config, so that all regional cloud resources are created in the right location:

$ pulumi config set gcp:region "<region>"        

As suggested by the wizard, running pulumi preview serves as a quick sanity check for your setup. If you see any errors here, address them before proceeding with the next steps.

$ pulumi preview
Previewing update (dev)

View in Browser (Ctrl+O): [...]

     Type                   Name                      Plan       
 +   pulumi:pulumi:Stack    gcp-mlflow-cloud-run-dev  create     
 +   └─ gcp:storage:Bucket  my-bucket                 create     

Outputs:
    bucketName: output<string>

Resources:
    + 2 to create        

That’s a promising start! Now, let’s remove the my-bucket resource from the index.ts file (which was put there by the project template) since we are ready to define our resources from scratch in the next steps.

?? Throughout the rest of this article, all code snippets in the text refer to the index.ts file (add them at the end of the file), unless otherwise mentioned.

Enabling Google Cloud Service APIs

Before we can move ahead and define the actual resources for the MLflow deployment, we need to prepare the Google Cloud project: Only a few Google Cloud services are enabled by default in a newly created GCP project, most other Cloud APIs need to be enabled before these services and their functionality become usable.

For our project, we will need access to several additional Google Cloud service APIs:

The following Pulumi snippet fetches the GCP provider configuration from the stack and enables these services. Execute pulumi up to apply your changes before proceeding (since enabling the APIs takes a few moments to take effect):

// Provider configuration
const gcpConfig = new pulumi.Config("gcp");
const project = gcpConfig.require("project");
const location = gcpConfig.require("region");

// Enable service APIs
const apis = [
  "compute",
  "artifactregistry",
  "run",
  "sqladmin",
  "secretmanager",
];
for (const api of apis) {
  new gcp.projects.Service(`${api} API`, {
    service: `${api}.googleapis.com`,
    disableDependentServices: true,
    disableOnDestroy: false,
  });
}        

Building an MLflow tracking server container image ??

Containerizing MLflow

Cloud Run is a container runtime platform, so we need to provide a Docker image in order to run MLflow there. Unfortunately, MLflow does not provide any official container images for running a remote tracking server. Instead, we will build our own image with the necessary Python dependencies to access the cloud resources for data storage.

Developing a Dockerfile for MLflow (located under docker/mlflow/Dockerfile ) is straightforward: We install the required packages to access PostgreSQL databases (the type offered by Cloud SQL) and Cloud Storage buckets. Additionally, build arguments allow us to customize both the Python and MLflow versions if needed:

ARG PYTHON_VERSION=3.12
FROM python:${PYTHON_VERSION}-slim
ARG MLFLOW_VERSION=2.12.1
RUN pip --no-cache-dir install \
    mlflow==${MLFLOW_VERSION} \
    google-cloud-storage \
    psycopg2-binary        
?? This Dockerfile is deliberately kept brief for this tutorial. You might want to apply best practices such as running as a non-root user, minimizing the image size, and applying labels to the image.

Building container images with Pulumi

To build container images with Pulumi, we must first include the @pulumi/docker dependency in our project. Since we will also need @pulumi/random later to generate random identifiers and passwords, let’s add it now as well:

$ npm add @pulumi/docker @pulumi/random        

In our Pulumi program, we can now add a local Docker image build (note that Cloud Run operates on a linux/amd64 platform, so we specify that explicitly) and an Artifact Registry repository resource in our GCP project as its destination:

import * as docker from "@pulumi/docker";
import * as random from "@pulumi/random";

// ...

// Artifact Registry repository for container images
const repo = new gcp.artifactregistry.Repository("repository", {
  repositoryId: "images",
  format: "DOCKER",
});
const repoUrl = pulumi.interpolate`${repo.location}-docker.pkg.dev/${repo.project}/${repo.repositoryId}`;

// MLflow container image
const image = new docker.Image("mlflow", {
  imageName: pulumi.interpolate`${repoUrl}/mlflow`,
  build: {
    context: "docker/mlflow",
    platform: "linux/amd64",
  },
});
export const imageDigest = image.repoDigest;        

We can now run pulumi up again, which will bring up the resources we just defined. Depending on your machine’s capabilities and your network connection, building and pushing the container image might take a few moments. The imageDigest stack output is set to the full image name (including the registry URL and SHA digest of the image).

?? If you receive an authentication error during the image push operation, make sure you have correctly set up authentication for Docker to Google Cloud .

Trying out the custom MLflow container image

After the Pulumi operation completes, we can run MLflow from the newly built container image in Artifact Registry to make sure it works as intended. The imageDigest stack output can be directly passed as an argument to docker run for that purpose:

$ docker run -it --rm \
    $(pulumi stack output imageDigest) \
    mlflow --version
mlflow, version 2.12.1        
?? If you are working on a Mac with Apple Silicon, you might encounter a warning saying the image's platform does not match the host platform. This is expected since we built the image for Cloud Run’s target platform linux/amd64 - you can safely ignore this warning.

Storing Data: Cloud Storage and Cloud SQL ??

Now that we have a custom MLflow image prepared, we need to establish a persistent storage location for the tracking server: This includes a Cloud Storage bucket for storing artifacts and a Cloud SQL PostgreSQL instance for managing tracking metadata and authentication information.

First, let’s define the Pulumi resources for a storage bucket with a unique name and public access disabled for security reasons. You may choose a different GCS bucket location if desired:

// Storage Bucket for artifacts
const bucketSuffix = new random.RandomId("artifact bucket suffix", {
  byteLength: 4,
});
const artifactBucket = new gcp.storage.Bucket("artifacts", {
  name: pulumi.concat("mlflow-artifacts-", bucketSuffix.hex),
  location: "EU",
  uniformBucketLevelAccess: true,
  publicAccessPrevention: "enforced",
});
export const bucket = artifactBucket.name;        

The bucket stack output will contain the name of the GCS bucket, allowing for easy identification in the Cloud Console.

Setting up Cloud SQL involves a few supporting resources, including user credentials and two databases for tracking and authentication data . The following code snippet creates these resources, as well as the server instance itself:

// Cloud SQL instance for tracking backend storage and authentication data
const instance = new gcp.sql.DatabaseInstance("mlflow", {
  databaseVersion: "POSTGRES_15",
  deletionProtection: false,
  settings: {
    tier: "db-f1-micro",
    availabilityType: "ZONAL",
    activationPolicy: "ALWAYS",
  },
});

const trackingDb = new gcp.sql.Database("tracking", {
  instance: instance.name,
  name: "mlflow",
});

const authDb = new gcp.sql.Database("auth", {
  instance: instance.name,
  name: "mlflow-auth",
});

const dbPassword = new random.RandomPassword("mlflow", {
  length: 16,
  special: false,
});
const user = new gcp.sql.User("mlflow", {
  instance: instance.name,
  name: "mlflow",
  password: dbPassword.result,
});

export const trackingDbInstanceUrl = pulumi.interpolate`postgresql://${user.name}:${user.password}@/${trackingDb.name}?host=/cloudsql/${instance.connectionName}`;
export const authDbInstanceUrl = pulumi.interpolate`postgresql://${user.name}:${user.password}@/${authDb.name}?host=/cloudsql/${instance.connectionName}`;        

The trackingDbInstanceUrl and authDbInstanceUrl stack outputs contain the PostgreSQL connection strings for the created databases (which will be passed to MLflow later on).

?? Note that spinning up a new Cloud SQL database instance is a relatively slow operation and can take up to 10 minutes to complete.

Authentication Configuration in Secrets Manager ??

Setting up authentication for an MLflow tracking server (see the docs ) involves creating a configuration file, named basic_auth.ini in the documentation. This file contains credentials for the administrator account and the database connection string for authentication information.

To securely store such sensitive information (since it contains a password), we use Secret Manager . Secrets can be easily mounted to the file system of Cloud Run services (see the docs for details), which will come in handy later when we configure the MLflow container to use the authentication config file.

The following snippet adds a new secret and its version containing the authentication config, and exports the initial credentials as stack outputs (adminUsername/adminPassword):

// Secret Manager
const authSecret = new gcp.secretmanager.Secret("mlflow-basic-auth-conf", {
  secretId: "basic_auth-ini",
  replication: { auto: {} },
});

const adminPw = new random.RandomPassword("mlflow-admin", {
  length: 16,
  special: false,
});
export const adminUsername = "admin";
export const adminPassword = adminPw.result.apply((pw) => pw);

const authSecretVersion = new gcp.secretmanager.SecretVersion(
  "mlflow-auth-conf",
  {
    secret: authSecret.id,
    secretData: pulumi.interpolate`[mlflow]
default_permission = READ
database_uri = ${authDbInstanceUrl}
admin_username = ${adminUsername}
admin_password=${adminPassword}
authorization_function = mlflow.server.auth:authenticate_request_basic_auth
`,
  }
);        

Deploying serverless MLflow in Cloud Run ??

Cloud Run services have relatively complex definitions, which for example configure the execution environment (e.g., its CPU and RAM resources) and the containerized workload itself (such as its image name and commands used to start the service, as well as the service identity).

First, we create a Google Cloud service account for the MLflow service with appropriate IAM permissions . This will allow the Cloud Run service to access the Cloud SQL database instance and secrets in Secret Manager.

// Service Account and IAM role bindings
const sa = new gcp.serviceaccount.Account("mlflow", {
  accountId: "mlflow",
});
const roles = ["roles/cloudsql.client", "roles/secretmanager.secretAccessor"];
for (const role of roles) {
  new gcp.projects.IAMMember(role, {
    project: project,
    role,
    member: pulumi.concat("serviceAccount:", sa.email),
  });
}

const iam = new gcp.storage.BucketIAMMember("artifacts access", {
  bucket: bucket,
  member: pulumi.concat("serviceAccount:", sa.email),
  role: "roles/storage.objectUser",
});        

With all prerequisites in place, we can now create the actual Cloud Run service for the MLflow tracking server. We'll start by assembling the command-line arguments used to launch the tracking server in a container, where we refer to the previously created services and resources:

const command = [
  "mlflow",
  "server",
  "--host",
  "0.0.0.0",
  "--port",
  "5000",
  "--artifacts-destination",
  artifactBucket.name.apply((name) => `gs://${name}`),
  "--backend-store-uri",
  trackingDbInstanceUrl.apply((s) => s),
  "--app-name",
  "basic-auth",
];        

Let’s break this command line apart a little bit further:

  • mlflow server is the MLflow CLI command used to launch a tracking server.
  • --host 0.0.0.0 and --port 5000 instruct the server to listen on all network interfaces of the container, bound to TCP port 5000.
  • --artifacts-destination configures Cloud Storage as the storage location for artifacts by providing the name of the previously created bucket.
  • --backend-store-uri passes the connection string for the Cloud SQL database that will be used to store tracking metadata.
  • --app-name basic-auth enables HTTP Basic Auth authentication.

We use this command line to define a Cloud Run service with all necessary metadata:

const service = new gcp.cloudrunv2.Service("mlflow", {
  location,
  template: {
    serviceAccount: sa.email,
    volumes: [
      {
        name: "auth-config",
        secret: {
          secret: authSecret.id,
        },
      },
      {
        name: "cloudsql",
        cloudSqlInstance: {
          instances: [dbInstance.connectionName],
        },
      },
    ],
    containers: [
      {
        image: imageDigest,
        commands: command,
        volumeMounts: [
          {
            name: "auth-config",
            mountPath: "/secrets",
          },
          {
            name: "cloudsql",
            mountPath: "/cloudsql",
          },
        ],
        ports: [{ containerPort: 5000 }],
        envs: [
          {
            name: "MLFLOW_AUTH_CONFIG_PATH",
            value: pulumi.interpolate`/secrets/${authSecret.secretId}`,
          },
        ],
        resources: {
          limits: {
            memory: "1024Mi",
            cpu: "1",
          },
          startupCpuBoost: true,
        },
      },
    ],
  },
});

export const serviceUrl = service.uri;        

Let's examine each component of this Cloud Run configuration in detail:

  • The service uses the service account defined earlier.
  • We configure volumes for access to the Cloud SQL instance and the Secret Manager secret used for authentication config
  • The service consists of a single container, which uses the MLflow container image we have built at the beginning of this post.
  • The command used to start the MLflow server process comes from previous step.
  • The authentication config file secret and a Cloud SQL socket are mounted from volumes.
  • The MLflow server listens on port 5000, so it is exposed from the container.
  • We set the MLFLOW_AUTH_CONFIG_PATH environment variable to tell MLflow about the configuration file for authentication, which is located at the mount path defined above.
  • Resource limits control the amount of CPU and memory available to the service instance, as well as allocating additional CPU resources during container startup .

After running pulumi up, the serviceUrl stack output contains the HTTPS URL of the MLflow service.

It’s worth noting that we have not defined any domain names for our service or dealt with SSL certificates: Cloud Run can handle these things automatically, but it also allows using custom domains and certificates.

By default, Cloud Run services are private, meaning they can only be accessed by authenticated Google Cloud IAM users with Project Owner, Project Editor, or Cloud Run Admin / Invoker permissions. However, since we have deployed MLflow with its own authentication mechanism, we want to disable Cloud Run authentication. The following snippet grants the Cloud Run Invoker role to all IAM users, allowing public (unauthenticated) access to the service from the Internet:

// Allow unauthenticated public access to the service endpoint
new gcp.cloudrunv2.ServiceIamBinding("mlflow-public-access", {
  name: service.name,
  project,
  location,
  role: "roles/run.invoker",
  members: ["allUsers"],
});        

After all resources have been provisioned, your stack should look similar to this (from the output of pulumi stack):

$ pulumi stack
[...]

Current stack resources (26):
    TYPE                                                   NAME
    pulumi:pulumi:Stack                                    gcp-mlflow-cloud-run-dev
    ├─ gcp:serviceaccount/account:Account                  mlflow
    ├─ gcp:projects/service:Service                        secretmanager API
    ├─ random:index/randomId:RandomId                      artifact bucket suffix
    ├─ random:index/randomPassword:RandomPassword          mlflow-admin
    ├─ gcp:projects/service:Service                        run API
    ├─ gcp:projects/service:Service                        sqladmin API
    ├─ gcp:projects/service:Service                        servicenetworking API
    ├─ gcp:secretmanager/secret:Secret                     mlflow-basic-auth-conf
    ├─ gcp:artifactregistry/repository:Repository          repository
    ├─ random:index/randomPassword:RandomPassword          mlflow
    ├─ gcp:projects/iAMMember:IAMMember                    roles/cloudsql.client
    ├─ gcp:projects/iAMMember:IAMMember                    roles/secretmanager.secretAccessor
    ├─ gcp:storage/bucket:Bucket                           artifacts
    ├─ docker:index/image:Image                            mlflow
    ├─ gcp:storage/bucketIAMMember:BucketIAMMember         artifacts access
    ├─ gcp:sql/databaseInstance:DatabaseInstance           mlflow
    ├─ gcp:sql/user:User                                   mlflow
    ├─ gcp:sql/database:Database                           auth
    ├─ gcp:sql/database:Database                           tracking
    ├─ gcp:secretmanager/secretVersion:SecretVersion       mlflow-auth-conf
    ├─ gcp:cloudrunv2/service:Service                      mlflow
    ├─ gcp:cloudrunv2/serviceIamBinding:ServiceIamBinding  mlflow-public-access
    ├─ pulumi:providers:gcp                                default_7_11_2
    ├─ pulumi:providers:random                             default_4_16_0
    └─ pulumi:providers:docker                             default_4_5_1

Current stack outputs (7):
    OUTPUT                 VALUE
    adminPassword          [secret]
    adminUsername          admin
    authDbInstanceUrl      [secret]
    bucket                 mlflow-artifacts-fc7c8eed
    imageDigest            europe-west3-docker.pkg.dev/prj-mlo-dev-sandbox1/images/mlflow@sha256:efaca49b6bc83a429921a0a820b37e733947eabf94f25355eba9794d9b78ecdf
    serviceUrl             https://mlflow-215bfd0-fspgz6gyhq-ey.a.run.app
    trackingDbInstanceUrl  [secret]        

Taking it for a spin ????

Now that we have deployed the MLflow tracking server, we can start logging experiments there. In the following steps, we'll connect to the MLflow tracking server and perform a simple model training experiment.

The example assumes you have access to a Python (virtual) environment with the required packages installed (mlflow, scikit-learn, pandas). The companion repository for this post contains a requirements.txt file and the complete demo.py experiment code.

Since our server has authentication enabled, we have to provide credentials to access it. The MLflow Python client can either read these credentials from environment variables or a configuration file located in your home directory at ~/.mlflow/credentials. Since it is more explicit, especially when working with multiple tracking servers, the following example will use the MLFLOW_TRACKING_ environment variables.

The Pulumi stack outputs contain all necessary credentials. You can directly extract them into environment variables and use the mlflow command-line client to validate your setup by listing all experiments on the server:

$ export MLFLOW_TRACKING_URI=$(pulumi stack output serviceUrl)
$ export MLFLOW_TRACKING_USERNAME=$(pulumi stack output adminUsername)
$ export MLFLOW_TRACKING_PASSWORD=$(pulumi stack output --show-secrets adminPassword)
$ mlflow experiments search
Experiment Id    Name     Artifact Location  
---------------  -------  -------------------
0                Default  mlflow-artifacts:/0        

Since we have just deployed the server, the list only shows the default experiment created during the first launch of the tracking server. Nevertheless, this result demonstrates that several things were successful:

  • The HTTPS endpoint has a valid certificate and is publicly accessible
  • Authentication functions correctly and credentials are being validated
  • Communication between the MLflow tracking server and its underlying database is successful

We can now perform a simple machine learning experiment and log it to the tracking server. For simplicity, we will use the scikit-learn library , since it is supported by MLflow’s auto-logging feature .

The following example code trains a basic decision tree classifier for the well-known Iris flower dataset and assesses its performance using the MLflow Model Evaluation API . Optimizing model performance is beyond the scope of this article, so we keep the model architecture and feature engineering as simple as possible.

import mlflow
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

mlflow.autolog()
mlflow.start_run()

# Data preprocessing
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Model training
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Model evaluation - see https://mlflow.org/docs/latest/models.html#model-evaluation
eval_df = pd.concat([X_test, y_test], axis=1)
eval_data = mlflow.data.from_pandas(eval_df, targets=y_test.name, name="test")
result = mlflow.evaluate(
    model=model.predict,
    data=eval_data,
    model_type="classifier",
)
print(result.metrics)        

When you execute this Python script in a shell with the appropriate MLFLOW_TRACKING_ environment variables set, the results will be logged the experiment to the tracking server:

$ python demo.py
2024/03/08 10:07:56 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.

[... output omitted ...]

{'example_count': 38, 'accuracy_score': 0.9736842105263158, 'recall_score': 0.9736842105263158, 'precision_score': 0.9760765550239235, 'f1_score': 0.9738570564341891}        

To confirm that the experiment has been logged to the server, you can access the MLflow web UI using the URL and login credentials provided by pulumi stack output --show-secrets. After logging in, you should be able to find the recently executed experiment run in the user interface:

We can see that the scikit-learn autologger has captured the parameters of the trained model as well as the performance metrics calculated through the mlflow.evaluate() API. The input and evaluation datasets are recorded automatically as well. This simple model could now serve as a baseline for further experiments with more advanced model architectures or feature engineering techniques.

This concludes the brief demonstration of using the MLflow Python client to interact with the tracking server deployed in Cloud Run. For more comprehensive guidance on maximizing the potential of your new experiment tracking infrastructure, please check out the official MLflow documentation .

Analyzing the costs ??

Earlier in this post, we claimed that deploying MLflow in Cloud Run is a cost-effective way to set up experiment tracking infrastructure for small teams. But how does that hold up in practice? Let’s break down the costs by Google Cloud component:

  • Artifact Registry : Up to 0.5 GB of storage is free, and unless you frequently rebuild the MLflow container image without deleting old versions, you will likely remain within the free tier.
  • Cloud Storage : Costs vary depending on the storage location, level of replication, size of logged artifacts, and data transfer traffic charges. Generally, you can expect less than $0.03 per GB per month in storage costs. Your actual costs will depend on the size of your logged artifacts and data transfers.
  • Cloud SQL : Cost depends on the region, with a db-f1-micro PostgreSQL instance setting you back around $25-30 per month (including backups with point-in-time recovery).
  • Cloud Run : Estimating the cost is harder due to its auto-scaling feature, which can scale down to zero instances based on request volume and distribution. A pessimistic estimate for a single 1 vCPU / 1 GB RAM instance running 24/7 comes out at around $50 per month.

In summary, with less than $100 per month in spending, you can get a reliable experiment tracking platform that offers authentication, automatic backups, load-based auto-scaling, monitoring, and observability. It can handle multiple users performing collaborative or independent experiments and can easily scale vertically if needed, by changing a few lines of the Infrastructure-as-code definition.

Comparing this to entry-level offerings of commercial providers like Weights and Biases , which starts at $50 per user per month, or Azure ML (with the smallest possible instance, DS1) around $45 per month, it becomes clear that operating your own experiment tracking platform can be a competitive option even for small teams.

An alternative architecture for deploying MLflow is to run it on Kubernetes (GKE on Google Cloud). In this scenario, costs are comprised of two major components: the costs for the control plane and the worker node pool. The monthly management cost for a single zonal GKE cluster is $74, and a single e2-standard-2 node (which is relatively small) comes in at around $50 per month. In total, unless your organization is already using GKE (or another managed Kubernetes service), it won’t be cost-effective to start doing that just for your experiment tracking solution. This balance might shift, however, if you leverage Kubernetes for other parts of your ML tool stack.

What’s next?

While this tutorial is already quite lengthy, there are still a few important topics that did not fit in:

  • Network-level access control: While GCP sets up reasonable firewall rules by default, in any serious use case, please make sure to review and adapt the defaults to your needs.
  • Monitoring and Observability: All the components in our architecture expose rich metrics and service health information through Cloud Monitoring . There, you can use pre-defined or custom dashboards to monitor metrics or configure alerts and notifications .
  • Ongoing maintenance: While it’s certainly possible to use Pulumi to keep the infrastructure updated (especially the version of MLflow), these things can be automated. One option to consider is setting up automatic image builds using Cloud Build , which offers a variety of build triggers, and automatic deployments to Cloud Run with Cloud Deploy . The combination of Cloud Build, Cloud Deploy, and Cloud Run allows setting up a serverless Continuous Delivery pipeline.

Conclusion

In this tutorial, you have learned how to create a robust and feature-rich experiment tracking solution using Google Cloud Run and MLflow. Using these building blocks, we’ve built an adaptable and reliable platform that caters to the needs of growing machine learning teams. With Cloud Run handling the containerized service, Cloud Storage and Cloud SQL ensure dependable long-term storage of metadata and artifacts with automatic backups. As the infrastructure is defined as code with Pulumi, you can host the setup quickly and reliably on your own cloud infrastructure.

Are you ready to embark on your serverless MLflow journey? To help you get started yourself, you can find the full Pulumi definitions and Python example code in the GitHub repository associated with this post at https://github.com/aai-institute/gcp-mlflow-cloud-run . Feel free to explore and modify the code, deploy the tool stack in your own Google Cloud projects, and share feedback in the comments or through the repository.

If you are interested in learning more about machine learning, MLOps, and Trustworthy AI topics, check out the appliedAI Institute for Europe website for additional resources and insights.

Happy experimenting! ??????


要查看或添加评论,请登录

社区洞察

其他会员也浏览了