Issue #8: Marvelous MLOps
DBX?is a great tool developed by Databricks Labs that simplifies Databricks job deployment by taking care of uploading all dependencies (python files, whl files). You no longer need Databricks job JSON definitions which can become huge and hard to read, DBX supports Jinja which brings a lot of flexibility. Tagging, passing environment variables and Python parameters — this all is also possible with DBX.
These are the main reasons we chose for DBX deployment with reusable GitHub Actions workflow to standardize ML model deployment on Databricks. When it comes to standardization, we want to keep things modular enough to accommodate more specific use cases, that's why we provide both reusable composite actions and reusable GitHub Actions workflows that work for internal & private repositories within the GitHub organization.
Our Marvelous Actions & Workflow
We created a repository?https://github.com/marvelousmlops/marvelous-workflows?with the following structure:
├── README.md <- The top-level README
└── .github
├── workflows <- Reusable workflows folder
├── databricks_job_dbx.yml <- Workflow for dbx deployment
└── databricks_job_dbx.md <- Workflow documentation
└── actions <- Reusable composite actions
├── deploy_dbx <- Action for dbx deployment
└── setup_env_vars <- Action for setting up env vars
Reusable composite actions
This action requires?databricks_host?and?databricks_token?as input and sets up 4 environment variables:
This action can be used as:
- name: Setup env vars
id: setup_env_vars
uses: marvelousmlops/marvelous-workflows/.github/actions/setup_env_vars@v1
with:
databricks_token: ${{ secrets.DATABRICKS_TOKEN }}
databricks_host: ${{ secrets.DATABRICKS_HOST }}
2. deploy_dbx
This action takes the following inputs:
This composite action consists of the following steps:
Note: This action requires DATABRICKS_TOKEN and DATABRICKS_HOST environment variables to be available on GitHub Actions runner.
This action can be used as:
- name: Deploy dbx
id: deploy_dbx
uses: marvelousmlops/marvelous-workflows/.github/actions/deploy_dbx@v1
with:
workspace-dir: "/Shared/amazon-reviews-databricks"
artifact-location: "dbfs:/Shared/amazon-reviews-databricks"
deployment-file: "conf/dbx_deployment.j2"
run-job-now: "yes"
Reusable GitHub Actions workflow
databricks_job_dbx.yml?is a reusable workflow that consists of the following steps:
领英推荐
This step is important to be able to find out when the workflow exactly ran and what inputs it had at a later stage.
2. generate token
This step generates a GitHub token from GitHub App that has read permissions for all repositories within the organization. Find?here?how to set up an app. Store GitHub App ID and private key as?organization secrets?APP_ID, APP_PRIVATE_KEY. We recommend using an organization secret that has all repositories of the organization in its scope in this case.
3. setup GIT_TOKEN?as an environment variable
4. Checkout marvelousmlops/marvelous-workflows repository using GIT_TOKEN
This step is extremely important. If we want to use composite actions which are defined in the same repository as the GitHub Actions workflow within that workflow, we can not use a relative path. When the workflow is
called from another repository, for example, marvelousmlops/amazon-reviews-databricks, GitHub runner would then look for the action defined in the amazon-reviews-databricks repository and fail.
We could just reference our actions as marvelousmlops/deploy_dbx@v1 and marvelousmlops/setup_env_vars@v1, but this approach would have some limitations: every time the workflow is updated and a new git tag is created, we would also need to update action version in the workflow, which would lead us to a loop situation.
If we just reference it as?@master/?@develop, it will cause conflicts: if someone chooses to stay on version v1 of the workflow and actions in the corresponding branch get updated, the workflow may start doing
unexpected things. We want all versions to be stable.
It leaves us with the following solution: checking out the workflows repository with a certain reference. In that way, we can ensure that version v1 of the workflow will always stay the same
5. Setup environment variables
This step will execute setup_env_vars action described above.
Values for DATABRICKS_TOKEN and DATABRICKS_HOST are taken from the values of corresponding secrets. In?this article, we explain how to set up a long-living DATABRICKS_TOKEN for SPN for automation.
6. Deploy dbx. This step will execute deploy_dbx action described above
Deployment example with reusable workflow
To show how to use reusable workflow, we will need another repository. We used?https://github.com/marvelousmlops/amazon-reviews-databricks.
name: "Train and deploy amazon models dbx reusable"
on:
workflow_dispatch:
jobs:
deploy_job:
uses: marvelousmlops/marvelous-workflows/.github/workflows/databricks_job_dbx.yml@v1
with:
deployment-file: "recommender/dbx_recommender_deployment.yml.j2"
toolkit-ref: v1
run-job-now: "True"
secrets: inherit
This is how?dbx_recommender_deployment.yml.j2?file looks like:
build:
python: "pip"
environments:
default:
workflows:
- name: "train_and_deploy_recommender_model"
job_clusters:
- job_cluster_key: "recommender_cluster"
new_cluster:
spark_version: "12.2.x-cpu-ml-scala2.12"
num_workers: 1
node_type_id: "Standard_D4s_v5"
spark_env_vars:
DATABRICKS_HOST: "{{ env['DATABRICKS_HOST'] }}"
DATABRICKS_TOKEN: {{ '"{{secrets/keyvault/DatabricksToken}}"' }}
GIT_SHA: "{{ env['GIT_SHA'] }}"
tasks:
- task_key: "train_model"
job_cluster_key: "recommender_cluster"
spark_python_task:
python_file: "file://recommender/train_recommender.py"
parameters: ["--run_id", "{{parent_run_id}}", "--job_id", "{{job_id}}"]
- task_key: "deploy_model"
job_cluster_key: "recommender_cluster"
spark_python_task:
python_file: "file://recommender/deploy_recommender.py"
parameters: ["--run_id", "{{parent_run_id}}", "--job_id", "{{job_id}}"]
depends_on:
- task_key: "train_model"
For more details about the deployment of amazon reviews models, refer to our articles:
In coming weeks, we will publish an article about using dbx and will explain the deployment file in details.