DataOps: an Automation Journey in?Tuidi
Introduction
In Tuidi, just like any other successful startup, delivering efficient and scalable solutions for clients is essential to our growth. However, as we onboarded more clients, one major bottleneck became clear: manually setting up Azure resources for each new customer was slow, error-prone, and costly. This manual process not only hindered our ability to scale but could also cause inconsistencies in deployment.
To address this, we adopted an automated setup process, leveraging an Azure DevOps pipeline to deploy all the necessary Azure resources. This shift has significantly improved our efficiency, reducing setup times, the risk of errors, and allowing us to focus on delivering value to our customers.
In this article, we’ll explore how we moved to an automated deployment pipeline, and the tools that made it possible.
Context?—?Data Architecture
In order to understand our approach and replicate it, in this section we will provide a brief overview of our Data Architecture.
As mentioned in the Introduction, our cloud architecture is hosted on Microsoft Azure. Each client has an isolated resource group, containing the following resources:
Below is a visual representation of our data architecture:
Firstly, through a Data Factory pipeline, we extract raw data from the client’s source and load it into a “landing” container in parquet file format on the Azure StorageV2. The data in the parquet files is then loaded into our bronze layer delta tables on Databricks.
Secondly, the pipeline exectutes a series of ETL jobs on Databricks to load processed data into Delta tables, managed as EXTERNAL TABLES hosted in the silver storage layer.
Afterwards, the data loaded into the silver layer is then further processed and stored in the gold layer.
The final steps, which are run in parallel, are the Machine Learning pipelines run on Databricks as well as the pipelines that transfer processed data into a MySQL database that feed our web application.
Problem
Setting up the data architecture shown above and make it operational requires a series of steps that have so far been carried out manually via Azure GUI and Databricks GUI:
Solution
We automated the whole process by creating an Azure DevOps pipeline to carry out all the required steps, from deploying the Azure resources to generating the notebooks needed for the Medallion Architecture on Databricks.
Defining the necessary steps to be automated was the first obstacle we encountered. After a thorough analysis, we identified the following fundamental tasks:
In order to create a single automation pipeline, we decided to use an Azure DevOps pipeline.
Each phase listed above was transformed into a task defined using YAML notation, which make up the DevOps pipeline.
Azure DevOps Repository Structure
The source code defining the DevOps pipeline is stored in a git repository.
We divided the repository in two main directories:
ARM-TEMPLATES/
├── client_template/
├── standard_templates/
│ ├── adf/
│ │ ├── dev/
│ │ │ ├── ADF-parameters.json
│ │ │ └── ADF.json
│ │ ├── integration_runtime/
│ │ │ ├── ADF-parameters.json
│ │ │ └── ADF.json
│ │ ├── prod/
│ │ │ ├── ADF-parameters.json
│ │ │ └── ADF.json
│ ├── keyvault/
│ │ ├── keyvault-parameters
│ │ └── keyvault.json
│ ├── scripts/
│ │ ├── grant_adf_permission.sh
│ │ ├── grant_secret_to_df.sh
│ │ └── link_adf_dev_to_github.ps1
├── storage/
│ ├── storage-parameters.json
│ └── storage.json
└── devops_pipeline_deploy.yaml
README.md
Azure DevOps?pipeline
The YAML file devops_pipeline_deploy.yaml contained in the git repository defines the steps to be exectuted by the Azure DevOps pipeline.
The pipeline takes as input the following user-defined variables, which change for each client to be set up
To start the pipeline, the values of the variables shown above are passed to the YAML file to make each step parametric:
pool:
vmImage: 'ubuntu-24.04' # Version used to run the script
variables:
client: $(client_name)
HOST: $(DATABRICKS_HOST)
TOKEN: $(DATABRICKS_TOKEN)
WS_ID: $(WORKSPACE_ID)
STOR_PROD: $(STORAGE_NAME_PROD)
STOR_DEV: $(STORAGE_NAME_DEV)
GITHUB_TKN: $(GITHUB_TOKEN)
GITHUB_USR: $(GITHUB_USERNAME)
1. Create project repositories on?GitHub
After the setup, the first task of the pipeline is to create the GitHub repository containing the source code for the client ETL processes.
领英推荐
We used an inline script to automate the creation of the necessary repositories, also assigning permissions to the data engineering GitHub team:
- task: AzurePowerShell@5
displayName: 'Create GitHub Repo for ETL'
inputs:
azureSubscription: DataARM
ScriptType: InlineScript
Inline: |
$repoName = "$(client_name)-ETL" # Replace with your repository name
$orgName = "Tuidi-Company"
# GitHub PAT stored as an environment variable
$pat = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes("$(GITHUB_TKN)"))
# Check if the repository already exists
$repoCheckUri = "https://api.github.com/repos/$orgName/$repoName"
$repoExists = $false
try {
Invoke-RestMethod -Uri $repoCheckUri -Headers @{ 'Authorization' = 'Basic ' + $pat }
$repoExists = $true
} catch {
# Repository does not exist; continue to create it
}
if (-not $repoExists) {
# JSON body with the repository name and auto initialization of README
$body = @{
name = $repoName
auto_init = $true # Automatically initializes the repository with a README file
private = $true # Make the repository private
}
# Parameters for the API call to create the repository
$params = @{
'Uri' = "https://api.github.com/orgs/$orgName/repos"
'Headers' = @{ 'Authorization' = 'Basic ' + $pat }
'Method' = 'Post'
'ContentType' = 'application/json'
'Body' = ($body | ConvertTo-Json)
}
# Execute the API call to create the repository
Invoke-RestMethod @params
} else {
Write-Host "Repository '$repoName' already exists. Skipping creation."
}
# Now give the team authorizations as an admin for the repository
$teamName= "data-engineering" # GitHub team name
# Parameters for adding the team with admin permissions
$teamParams = @{
'Uri' = "https://api.github.com/orgs/$orgName/teams/$teamName/repos/$orgName/$repoName"
'Headers' = @{ 'Authorization' = 'Basic ' + $pat }
'Method' = 'PUT'
'ContentType' = 'application/json'
'Body' = (@{ permission = 'admin' } | ConvertTo-Json)
}
# Add the team as an admin
Invoke-RestMethod @teamParams
azurePowerShellVersion: LatestVersion
2. Create Azure Storage?Accounts
The second task is to create the Azure Storage Account for both Develop and Production environments.
We previously defined a standard structure for the ARM template to be used to create the Storage Accounts and added it to the git repository,
The template was then used by the task which takes as input the “client_name” variable defined during setup and creates the storage account following the ARM template and Tuidi’s internal naming convention.
- task: AzureResourceManagerTemplateDeployment@3
displayName: 'set up storage prod'
inputs:
azureResourceManagerConnection: DataARM
subscriptionId: '###########' # Replace with your subscription_id
resourceGroupName: '$(client_name)'
location: 'West Europe'
csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/storage/storage.json'
csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/storage/storage-paramters.json'
overrideParameters: '-storageAccount_name "$(client_name)"
enabled: true
3. Create Azure Key?Vault
The third task of the pipeline creates the Azure Key Vault resource.
We previously defined a standard ARM template for creating the Key Vault. The task takes the key-vault name and the client name as input parameters and creates the Secrets according to the standard naming convention.
- task: AzureResourceManagerTemplateDeployment@3
displayName: 'set up key vault'
inputs:
azureResourceManagerConnection: DataARM
subscriptionId: '###########' # Replace with your subscription_id
resourceGroupName: '$(client_name)'
location: 'West Europe'
csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/keyvault/keyvault.json'
csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/keyvault/keyvault-parameters'
overrideParameters: '-keyvaultname "$(client_name)" -client_name "$(client_name)"'
enabled: true
4. Create Azure Data Factories
The fourth task creates the three Azure Data Factory instances required for each new client:
This architecture with three data factories follows Microsoft’s best practices for CI/CD. In order to automate the deployment from ADF-Dev to ADF-Prod, a third data factory is necessary to contain the Integration Runtime linked to the client’s environment.
After defining a standard ARM template for the Data Factory, the pipeline creates the three instances taking the client name as input.
Below is an example for creating the first ADF instance: Client-ADF.
- task: AzureResourceManagerTemplateDeployment@3
displayName: 'set up Azure Data Factory (integration runtime)'
inputs:
azureResourceManagerConnection: DataARM
subscriptionId: '###########' # Replace with your subscription_id
resourceGroupName: '$(client_name)'
location: 'West Europe'
csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/adf/integration_runtime/ADF.json'
csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/adf/integration_runtime/ADF-parameters.json'
overrideParameters: '-client_name "$(client_name)"'
enabled: true
5. Grant Key Vault Permissions
After creating the ADF and Key Vault resources, the fifth task is to grant access to the ADF instances to the secrets stored in the Key Vault. This taks also takes the client name as input.
- task: AzureCLI@2
displayName: 'secret permissions to adf'
inputs:
azureSubscription: ADFGlobalSP
scriptType: bash
scriptPath: '$(System.DefaultWorkingDirectory)/standard_templates/scripts/grant_secret_to_df.sh'
addSpnToEnvironment: true
useGlobalConfig: true
enabled: true
env:
ResourceGroupName: $(client_name)
To do so, we must assing Get and List Secrets permissions to the ADF apps in the Key Vault Access Policies. This is done by executing the following bash script.
tenantId="your_tenant_id"
servicePrincipalId="your_service_principal_id"
servicePrincipalKey=" your_service_principal_key"
ResourceGroupName="${ResourceGroupName}" # This expects the ResourceGroupName to be passed as an env variable
# Retrieve Principal ID
principalId=$(az ad sp list --display-name "${ResourceGroupName}-ADF" --query "[0].id" -o tsv)
principalIdDev=$(az ad sp list --display-name "${ResourceGroupName}-ADF-Dev" --query "[0].id" -o tsv)
principalIdProd=$(az ad sp list --display-name "${ResourceGroupName}-ADF-Prod" --query "[0].id" -o tsv)
# Grant permission to ADF
az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalId" --secret-permissions get list
az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalIdDev" --secret-permissions get list
az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalIdProd" --secret-permissions get list
6. Grant Integration Runtime permission to ADF-Prod/Dev
We must provide Client-ADF integration runtime access to the Production and Develop Data Factories in order to allow the sharing of the Integration Runtime between the Data Factories.
In order to do so, we need to execute the following bash script.
# Define variables
TenantId="##############" # Replace with your Tenant Id
servicePrincipalId="###########" # Replace with your servicePrincipalId
servicePrincipalKey="###########" # Replace with your servicePrincipalKey
ResourceGroupName="${ResourceGroupName}"
DataFactoryName1="${ResourceGroupName}-ADF-Prod"
DataFactoryName2="${ResourceGroupName}-ADF-Dev"
echo "$DataFactoryName1"
echo "$DataFactoryName1"
# add datafactory extension
az extension add --name datafactory
# Fetch the MSI Principal ID of the Data Factory
principalId1=$(az ad sp list --display-name "$DataFactoryName1" --query "[0].id" -o tsv)
principalId2=$(az ad sp list --display-name "$DataFactoryName2" --query "[0].id" -o tsv)
# Fetch the Integration Runtime ID
integrationRuntime=$(az datafactory integration-runtime show --resource-group "$ResourceGroupName" --factory-name "${ResourceGroupName}-ADF" --name "${ResourceGroupName}IntegrationRuntime" --query "id" -o tsv)
# Assign the Contributor role to the Data Factory's Managed Identity for the Integration Runtime
echo "Assigning Contributor role to $DataFactoryName1 (principalId: $principalId1)"
az role assignment create --assignee "$principalId1" --role "Contributor" --scope "$integrationRuntime"
echo "Assigning Contributor role to $DataFactoryName2 (principalId: $principalId2)"
az role assignment create --assignee "$principalId2" --role "Contributor" --scope "$integrationRuntime"
The script shown above is called by the following task.
- task: AzureCLI@2
displayName: 'grant IR permission on ADF Prod&Dev'
inputs:
azureSubscription: ADFGlobalSP
scriptType: bash
scriptPath: '$(System.DefaultWorkingDirectory)/standard_templates/scripts/grant_adf_permission.sh'
addSpnToEnvironment: true
useGlobalConfig: true
enabled: true
env:
ResourceGroupName: $(client_name)
7. Set up Databricks automation YAML
After setting up the resources in the Azure environment, we moved to the creation of the Databricks objects in our common Databricks Workspaces.
In order to do so, we created a YAML file to be run through a DevOps pipeline. The following snippet shows the parameters used for the task.
pool:
vmImage: 'ubuntu-24.04' # Use a Linux VM image
variables:
client: $(client_name)
HOST_DEV: $(DATABRICKS_HOST_DEV)
HOST_PROD: $(DATABRICKS_HOST_PROD)
TOKEN_DEV: $(DATABRICKS_TOKEN_DEV)
TOKEN_PROD: $(DATABRICKS_TOKEN_PROD)
WS_ID_DEV: $(WORKSPACE_ID_DEV)
WS_ID_PROD: $(WORKSPACE_ID_PROD)
STOR_DEV: $(STORAGE_NAME_DEV) # Storage name for dev
STOR_PROD: $(STORAGE_NAME_PROD) # Storage name for prod
parameters:
- name: environments
displayName: Environments
type: object
default:
- dev
- prod
- name: containers
displayName: Containers
type: object
default:
- landing
- externallocationuc
- bronze
- silver
- gold
8. Create ETL folder on Databricks Workspace (DEV &?PROD)
Firstly, we iterate over the two environments DEV and PROD to execute the tasks in both workspaces. Within the loop, we run several Databricks CLI commands to
jobs:
- job: Setup
displayName: 'Setup Databricks Environment'
pool:
vmImage: 'ubuntu-24.04'
steps:
# Step 0: Install Databricks CLI
- script: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
displayName: 'Install Databricks CLI'
# Step 2-6: Loop over environments
- ${{ each env in parameters.environments }}:
- script: |
echo "Running for environment: ${{ env }}"
# Set environment-specific variables
if [ "${{ env }}" == "dev" ]; then
export DATABRICKS_HOST=$(HOST_DEV)
export DATABRICKS_TOKEN=$(TOKEN_DEV)
export WS_ID=$(WS_ID_DEV)
export STORAGE_NAME=$(STOR_DEV) # Use dev storage name
else
export DATABRICKS_HOST=$(HOST_PROD)
export DATABRICKS_TOKEN=$(TOKEN_PROD)
export WS_ID=$(WS_ID_PROD)
export STORAGE_NAME=$(STOR_PROD) # Use prod storage name
fi
echo "Databricks host: $DATABRICKS_HOST"
echo "Workspace ID: $WS_ID"
echo "Storage Account: $STORAGE_NAME"
# Create a new folder in the Databricks workspace for each environment
echo "Creating new folder in Databricks workspace for client: $(client) in ${{ env }}"
databricks workspace mkdirs /Workspace/$(client)
displayName: 'Create Workspace Folder for ${{ env }}'
# Loop through containers for each environment
- ${{ each container in parameters.containers }}:
- script: |
echo "Creating external location for environment: ${{ env }}"
# Set location name based on environment
if [ "${{ env }}" == "dev" ]; then
export DATABRICKS_HOST=$(HOST_DEV)
export DATABRICKS_TOKEN=$(TOKEN_DEV)
export WS_ID=$(WS_ID_DEV)
export STORAGE_NAME=$(STOR_DEV)
location_name="${{ container }}_dev"
else
export DATABRICKS_HOST=$(HOST_PROD)
export DATABRICKS_TOKEN=$(TOKEN_PROD)
export WS_ID=$(WS_ID_PROD)
export STORAGE_NAME=$(STOR_PROD)
location_name="${{ container }}"
fi
echo "Creating external location: $location_name"
# Create external location using container name in the URL and environment-specific storage account
echo "databricks external-locations create $(client)_${location_name} abfss://${{ container }}@$STORAGE_NAME.dfs.core.windows.net/ tuidistoragecredential"
databricks external-locations create $(client)_${location_name} abfss://${{ container }}@$STORAGE_NAME.dfs.core.windows.net/ tuidistoragecredential
# Set isolation mode to ISOLATED
databricks external-locations update $(client)_${location_name} --isolation-mode ISOLATION_MODE_ISOLATED
# Grant access to the current workspace
databricks workspace-bindings update-bindings external-location $(client)_${location_name} --json "{\"add\": [{\"workspace_id\": $WS_ID, \"binding_type\": \"BINDING_TYPE_READ_WRITE\"}]}"
# Set dataeng service principal as owner
databricks external-locations update $(client)_${location_name} --owner "dataeng"
displayName: 'Create and Configure External Location for ${{ container }} in ${{ env }}'
# Step 7: Create a catalog for the environment (only once per environment)
- ${{ each env in parameters.environments }}:
- script: |
source $(Agent.BuildDirectory)/brew_env.sh
echo "Creating catalog for environment: ${{ env }}"
# Set environment-specific storage root for catalog
if [ "${{ env }}" == "dev" ]; then
export DATABRICKS_HOST=$(HOST_DEV)
export DATABRICKS_TOKEN=$(TOKEN_DEV)
export WS_ID=$(WS_ID_DEV)
export STORAGE_NAME=$(STOR_DEV)
storage_root="abfss://externallocationuc@$(STOR_DEV).dfs.core.windows.net/"
else
export DATABRICKS_HOST=$(HOST_PROD)
export DATABRICKS_TOKEN=$(TOKEN_PROD)
export WS_ID=$(WS_ID_PROD)
export STORAGE_NAME=$(STOR_PROD)
storage_root="abfss://externallocationuc@$(STOR_PROD).dfs.core.windows.net/"
fi
# Create catalog for the client in this environment
databricks catalogs create $(client) --storage-root $storage_root
echo "Catalog $(client)_${{ env }} created with storage root: $storage_root"
# Update catalog owner to dataeng service principal
databricks catalogs update $(client) --owner "dataeng"
echo "Catalog owner for $(client) set to dataeng in ${{ env }}"
# create schema bronze
schema_exists=$(databricks schemas list --catalog-name $(client) --output json | jq -r ".[] | select(.name == \"bronze\") | .name")
if [ -z "$schema_exists" ]; then
databricks schemas create bronze $(client) --comment "Schema bronze for $(client)"
echo "Schema 'bronze' created successfully."
else
echo "Schema 'bronze' already exists."
fi
#databricks schemas create bronze $(client)
Conclusion
This article illustrated the process to automate the deployment of resources within the Azure and Databricks environments using a DevOps pipeline.
We would love to hear your feedback on our approach. Is there anything we could add to the pipeline or anything we could do differently?