DataOps: an Automation Journey in?Tuidi

DataOps: an Automation Journey in?Tuidi

Lorenzo Carta Sabrina Sforza Francesco Ranieri Federico Nardi

Introduction

In Tuidi, just like any other successful startup, delivering efficient and scalable solutions for clients is essential to our growth. However, as we onboarded more clients, one major bottleneck became clear: manually setting up Azure resources for each new customer was slow, error-prone, and costly. This manual process not only hindered our ability to scale but could also cause inconsistencies in deployment.

To address this, we adopted an automated setup process, leveraging an Azure DevOps pipeline to deploy all the necessary Azure resources. This shift has significantly improved our efficiency, reducing setup times, the risk of errors, and allowing us to focus on delivering value to our customers.

In this article, we’ll explore how we moved to an automated deployment pipeline, and the tools that made it possible.

Context?—?Data Architecture

In order to understand our approach and replicate it, in this section we will provide a brief overview of our Data Architecture.

As mentioned in the Introduction, our cloud architecture is hosted on Microsoft Azure. Each client has an isolated resource group, containing the following resources:

  • Two Azure storage account (StorageV2 (general purpose v2)) for Develop and Production Environments
  • One Azure Key Vault storing the client’s Storage, Databricks and Database secrets and access keys.
  • Three Azure Data Factories (ADF)

  1. Client-ADF hosting the Integration Runtime used to connect to the client’s data source
  2. Client-ADF-Dev for Develop pipelines
  3. Client-ADF-Prod for Production pipelines

  • Two Databricks workspaces with corresponding Unity Catalogs

  1. Client_dev hosting the Databricks Development Workspace and Data Architecture
  2. Client_prod hosting the Databricks Production Workspace and Data Architecture

Below is a visual representation of our data architecture:

Tuidi Data Architecture

Firstly, through a Data Factory pipeline, we extract raw data from the client’s source and load it into a “landing” container in parquet file format on the Azure StorageV2. The data in the parquet files is then loaded into our bronze layer delta tables on Databricks.

Secondly, the pipeline exectutes a series of ETL jobs on Databricks to load processed data into Delta tables, managed as EXTERNAL TABLES hosted in the silver storage layer.

Afterwards, the data loaded into the silver layer is then further processed and stored in the gold layer.

The final steps, which are run in parallel, are the Machine Learning pipelines run on Databricks as well as the pipelines that transfer processed data into a MySQL database that feed our web application.

Problem

Setting up the data architecture shown above and make it operational requires a series of steps that have so far been carried out manually via Azure GUI and Databricks GUI:

  • Create Git Repositories
  • Create Resources
  • Set up Azure Data Factory instances
  • Manage and give permission to our service principals for the Data Factory, storage, and Key Vault resources
  • Set up Catalogs on Databricks (bronze, silver, gold)
  • Create ETL notebooks for silver and gold layer.

Solution

We automated the whole process by creating an Azure DevOps pipeline to carry out all the required steps, from deploying the Azure resources to generating the notebooks needed for the Medallion Architecture on Databricks.

Defining the necessary steps to be automated was the first obstacle we encountered. After a thorough analysis, we identified the following fundamental tasks:

  1. Create repositories on GitHub
  2. Create storage accounts and containers
  3. Create Key Vault
  4. Grant Key Vault Permissions
  5. Create Data Factories
  6. Grant Integration Runtime permission to ADF-Prod and ADF-Dev
  7. Databricks automatic setup

In order to create a single automation pipeline, we decided to use an Azure DevOps pipeline.

Each phase listed above was transformed into a task defined using YAML notation, which make up the DevOps pipeline.

Azure DevOps Repository Structure

The source code defining the DevOps pipeline is stored in a git repository.

We divided the repository in two main directories:

  • Client_template: contains customized Azure Resource Manager (ARM) templates used to define resources for custom client requirements.
  • Standard_template: contains ARM templates defining standard resources applicable to all clients.

ARM-TEMPLATES/
├── client_template/
├── standard_templates/
│   ├── adf/
│   │   ├── dev/
│   │   │   ├── ADF-parameters.json
│   │   │   └── ADF.json
│   │   ├── integration_runtime/
│   │   │   ├── ADF-parameters.json
│   │   │   └── ADF.json
│   │   ├── prod/
│   │   │   ├── ADF-parameters.json
│   │   │   └── ADF.json
│   ├── keyvault/
│   │   ├── keyvault-parameters
│   │   └── keyvault.json
│   ├── scripts/
│   │   ├── grant_adf_permission.sh
│   │   ├── grant_secret_to_df.sh
│   │   └── link_adf_dev_to_github.ps1
├── storage/
│   ├── storage-parameters.json
│   └── storage.json
└── devops_pipeline_deploy.yaml
README.md        

Azure DevOps?pipeline

The YAML file devops_pipeline_deploy.yaml contained in the git repository defines the steps to be exectuted by the Azure DevOps pipeline.

The pipeline takes as input the following user-defined variables, which change for each client to be set up

DevOps pipeline variables


To start the pipeline, the values of the variables shown above are passed to the YAML file to make each step parametric:

pool: 
  vmImage: 'ubuntu-24.04'  # Version used to run the script
variables: 
  client: $(client_name) 
  HOST: $(DATABRICKS_HOST) 
  TOKEN: $(DATABRICKS_TOKEN) 
  WS_ID: $(WORKSPACE_ID) 
  STOR_PROD: $(STORAGE_NAME_PROD) 
  STOR_DEV: $(STORAGE_NAME_DEV) 
  GITHUB_TKN: $(GITHUB_TOKEN) 
  GITHUB_USR: $(GITHUB_USERNAME)         

1. Create project repositories on?GitHub

After the setup, the first task of the pipeline is to create the GitHub repository containing the source code for the client ETL processes.

We used an inline script to automate the creation of the necessary repositories, also assigning permissions to the data engineering GitHub team:

- task: AzurePowerShell@5 

  displayName: 'Create GitHub Repo for ETL' 

  inputs: 
    azureSubscription: DataARM 
    ScriptType: InlineScript 

    Inline: | 
     $repoName = "$(client_name)-ETL" # Replace with your repository name 
     $orgName = "Tuidi-Company" 
     # GitHub PAT stored as an environment variable 
     $pat = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes("$(GITHUB_TKN)")) 
     # Check if the repository already exists 
     $repoCheckUri = "https://api.github.com/repos/$orgName/$repoName" 
     $repoExists = $false 
     try { 
         Invoke-RestMethod -Uri $repoCheckUri -Headers @{ 'Authorization' = 'Basic ' + $pat } 
         $repoExists = $true 
     } catch { 
         # Repository does not exist; continue to create it 
     } 
     if (-not $repoExists) { 
         # JSON body with the repository name and auto initialization of README 
         $body = @{  
             name = $repoName 
             auto_init = $true # Automatically initializes the repository with a README file 
             private = $true    # Make the repository private 
         } 
         # Parameters for the API call to create the repository 
         $params = @{ 
             'Uri'         = "https://api.github.com/orgs/$orgName/repos" 
             'Headers'     = @{ 'Authorization' = 'Basic ' + $pat } 
             'Method'      = 'Post' 
             'ContentType' = 'application/json' 
             'Body'        = ($body | ConvertTo-Json) 
         } 
         # Execute the API call to create the repository 
         Invoke-RestMethod @params 
     } else { 
         Write-Host "Repository '$repoName' already exists. Skipping creation." 
     } 
     # Now give the team authorizations as an admin for the repository 
     $teamName= "data-engineering" # GitHub team name 
     # Parameters for adding the team with admin permissions 
     $teamParams = @{ 
         'Uri'         = "https://api.github.com/orgs/$orgName/teams/$teamName/repos/$orgName/$repoName" 
         'Headers'     = @{ 'Authorization' = 'Basic ' + $pat } 
         'Method'      = 'PUT' 
         'ContentType' = 'application/json' 
         'Body'        = (@{ permission = 'admin' } | ConvertTo-Json) 
     } 
     # Add the team as an admin 
     Invoke-RestMethod @teamParams 
    azurePowerShellVersion: LatestVersion         

2. Create Azure Storage?Accounts

The second task is to create the Azure Storage Account for both Develop and Production environments.

We previously defined a standard structure for the ARM template to be used to create the Storage Accounts and added it to the git repository,

The template was then used by the task which takes as input the “client_name” variable defined during setup and creates the storage account following the ARM template and Tuidi’s internal naming convention.

- task: AzureResourceManagerTemplateDeployment@3 
  displayName: 'set up storage prod' 
  inputs: 
    azureResourceManagerConnection: DataARM 
    subscriptionId: '###########' # Replace with your subscription_id 
    resourceGroupName: '$(client_name)' 
    location: 'West Europe' 
    csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/storage/storage.json' 
    csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/storage/storage-paramters.json' 
    overrideParameters: '-storageAccount_name "$(client_name)" 
  enabled: true         

3. Create Azure Key?Vault

The third task of the pipeline creates the Azure Key Vault resource.

We previously defined a standard ARM template for creating the Key Vault. The task takes the key-vault name and the client name as input parameters and creates the Secrets according to the standard naming convention.

- task: AzureResourceManagerTemplateDeployment@3 
  displayName: 'set up key vault' 
  inputs: 
    azureResourceManagerConnection: DataARM 
    subscriptionId: '###########' # Replace with your subscription_id 
    resourceGroupName: '$(client_name)' 
    location: 'West Europe' 
    csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/keyvault/keyvault.json' 
    csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/keyvault/keyvault-parameters' 
    overrideParameters: '-keyvaultname "$(client_name)" -client_name "$(client_name)"' 
  enabled: true         

4. Create Azure Data Factories

The fourth task creates the three Azure Data Factory instances required for each new client:

  1. Client-ADF: contains the Self Hosted Integration Runtime linked to customer’s data environment, ensuring secure data exchange between Tuidi and Customer
  2. Client-ADF-Dev: orchestrates data pipelines in the development environment. Its Integration Runtime is shared with the Integration Runtime of Client-ADF.
  3. Client-ADF-Prod: orchestrates data pipelines in the production environment. Its Integration Runtime is shared with the Integration Runtime of Client-ADF.

This architecture with three data factories follows Microsoft’s best practices for CI/CD. In order to automate the deployment from ADF-Dev to ADF-Prod, a third data factory is necessary to contain the Integration Runtime linked to the client’s environment.

After defining a standard ARM template for the Data Factory, the pipeline creates the three instances taking the client name as input.

Below is an example for creating the first ADF instance: Client-ADF.

- task: AzureResourceManagerTemplateDeployment@3 
  displayName: 'set up Azure Data Factory (integration runtime)' 
  inputs: 
    azureResourceManagerConnection: DataARM 
    subscriptionId: '###########' # Replace with your subscription_id 
    resourceGroupName: '$(client_name)' 
    location: 'West Europe' 
    csmFile: '$(System.DefaultWorkingDirectory)/standard_templates/adf/integration_runtime/ADF.json' 
    csmParametersFile: '$(System.DefaultWorkingDirectory)/standard_templates/adf/integration_runtime/ADF-parameters.json' 
    overrideParameters: '-client_name "$(client_name)"' 
  enabled: true         

5. Grant Key Vault Permissions

After creating the ADF and Key Vault resources, the fifth task is to grant access to the ADF instances to the secrets stored in the Key Vault. This taks also takes the client name as input.

- task: AzureCLI@2 
  displayName: 'secret permissions to adf' 
  inputs: 
    azureSubscription: ADFGlobalSP 
    scriptType: bash 
    scriptPath: '$(System.DefaultWorkingDirectory)/standard_templates/scripts/grant_secret_to_df.sh' 
    addSpnToEnvironment: true 
    useGlobalConfig: true 
  enabled: true 
  env: 
    ResourceGroupName: $(client_name)         

To do so, we must assing Get and List Secrets permissions to the ADF apps in the Key Vault Access Policies. This is done by executing the following bash script.

tenantId="your_tenant_id" 
servicePrincipalId="your_service_principal_id" 
servicePrincipalKey=" your_service_principal_key" 
ResourceGroupName="${ResourceGroupName}"  # This expects the ResourceGroupName to be passed as an env variable 

# Retrieve Principal ID 
principalId=$(az ad sp list --display-name "${ResourceGroupName}-ADF" --query "[0].id" -o tsv) 
principalIdDev=$(az ad sp list --display-name "${ResourceGroupName}-ADF-Dev" --query "[0].id" -o tsv) 

principalIdProd=$(az ad sp list --display-name "${ResourceGroupName}-ADF-Prod" --query "[0].id" -o tsv) 
# Grant permission to ADF 
az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalId" --secret-permissions get list 
az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalIdDev" --secret-permissions get list 

az keyvault set-policy --name "${ResourceGroupName}-KeyVault" --object-id "$principalIdProd" --secret-permissions get list         

6. Grant Integration Runtime permission to ADF-Prod/Dev

We must provide Client-ADF integration runtime access to the Production and Develop Data Factories in order to allow the sharing of the Integration Runtime between the Data Factories.

In order to do so, we need to execute the following bash script.

# Define variables 
TenantId="##############" # Replace with your Tenant Id
servicePrincipalId="###########" # Replace with your servicePrincipalId
servicePrincipalKey="###########" # Replace with your servicePrincipalKey
ResourceGroupName="${ResourceGroupName}" 
DataFactoryName1="${ResourceGroupName}-ADF-Prod" 
DataFactoryName2="${ResourceGroupName}-ADF-Dev" 
echo "$DataFactoryName1" 
echo "$DataFactoryName1" 
# add datafactory extension 
az extension add --name datafactory 
# Fetch the MSI Principal ID of the Data Factory 
principalId1=$(az ad sp list --display-name "$DataFactoryName1" --query "[0].id" -o tsv) 
principalId2=$(az ad sp list --display-name "$DataFactoryName2" --query "[0].id" -o tsv) 
# Fetch the Integration Runtime ID 
integrationRuntime=$(az datafactory integration-runtime show --resource-group "$ResourceGroupName" --factory-name "${ResourceGroupName}-ADF" --name "${ResourceGroupName}IntegrationRuntime" --query "id" -o tsv) 
# Assign the Contributor role to the Data Factory's Managed Identity for the Integration Runtime 
echo "Assigning Contributor role to $DataFactoryName1 (principalId: $principalId1)" 
az role assignment create --assignee "$principalId1" --role "Contributor" --scope "$integrationRuntime" 
echo "Assigning Contributor role to $DataFactoryName2 (principalId: $principalId2)" 
az role assignment create --assignee "$principalId2" --role "Contributor" --scope "$integrationRuntime"         

The script shown above is called by the following task.

- task: AzureCLI@2 
  displayName: 'grant IR permission on ADF Prod&Dev' 
  inputs: 
    azureSubscription: ADFGlobalSP 
    scriptType: bash 
    scriptPath: '$(System.DefaultWorkingDirectory)/standard_templates/scripts/grant_adf_permission.sh' 
    addSpnToEnvironment: true 
    useGlobalConfig: true 
  enabled: true 
  env: 
    ResourceGroupName: $(client_name)         

7. Set up Databricks automation YAML

After setting up the resources in the Azure environment, we moved to the creation of the Databricks objects in our common Databricks Workspaces.

In order to do so, we created a YAML file to be run through a DevOps pipeline. The following snippet shows the parameters used for the task.

pool:
  vmImage: 'ubuntu-24.04'  # Use a Linux VM image

variables:
  client: $(client_name)  
  HOST_DEV: $(DATABRICKS_HOST_DEV) 
  HOST_PROD: $(DATABRICKS_HOST_PROD) 
  TOKEN_DEV: $(DATABRICKS_TOKEN_DEV)  
  TOKEN_PROD: $(DATABRICKS_TOKEN_PROD)  
  WS_ID_DEV: $(WORKSPACE_ID_DEV)  
  WS_ID_PROD: $(WORKSPACE_ID_PROD)  
  STOR_DEV: $(STORAGE_NAME_DEV)  # Storage name for dev
  STOR_PROD: $(STORAGE_NAME_PROD)  # Storage name for prod


parameters:
- name: environments
  displayName: Environments
  type: object
  default: 
   - dev
   - prod
- name: containers
  displayName: Containers
  type: object
  default: 
   - landing
   - externallocationuc
   - bronze
   - silver
   - gold        

8. Create ETL folder on Databricks Workspace (DEV &?PROD)

Firstly, we iterate over the two environments DEV and PROD to execute the tasks in both workspaces. Within the loop, we run several Databricks CLI commands to

  1. Create a client folder in the Workspace
  2. Loop over the containers parameters to create external locations for each container in the storage account
  3. Assign owner permissions to our team’s Databricks Service Principal
  4. Restrict access to the dev and prod External Locations depending on the environment (Dev and Prod)
  5. Create client catalog
  6. Create medallion architecture schemas within the catalog

jobs: 

  - job: Setup
    displayName: 'Setup Databricks Environment'
    pool:
      vmImage: 'ubuntu-24.04' 

    steps:
      # Step 0: Install Databricks CLI
      - script: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
        displayName: 'Install Databricks CLI'

      # Step 2-6: Loop over environments
      - ${{ each env in parameters.environments }}:
          - script: |
              echo "Running for environment: ${{ env }}"

              # Set environment-specific variables
              if [ "${{ env }}" == "dev" ]; then
                export DATABRICKS_HOST=$(HOST_DEV)
                export DATABRICKS_TOKEN=$(TOKEN_DEV)
                export WS_ID=$(WS_ID_DEV)
                export STORAGE_NAME=$(STOR_DEV)  # Use dev storage name
              else
                export DATABRICKS_HOST=$(HOST_PROD)
                export DATABRICKS_TOKEN=$(TOKEN_PROD)
                export WS_ID=$(WS_ID_PROD)
                export STORAGE_NAME=$(STOR_PROD)  # Use prod storage name
              fi

              echo "Databricks host: $DATABRICKS_HOST"
              echo "Workspace ID: $WS_ID"
              echo "Storage Account: $STORAGE_NAME"
              
              # Create a new folder in the Databricks workspace for each environment
              echo "Creating new folder in Databricks workspace for client: $(client) in ${{ env }}"
              databricks workspace mkdirs /Workspace/$(client)
            displayName: 'Create Workspace Folder for ${{ env }}'

          # Loop through containers for each environment
          - ${{ each container in parameters.containers }}:
              - script: |
                  echo "Creating external location for environment: ${{ env }}"

                  # Set location name based on environment
                  if [ "${{ env }}" == "dev" ]; then
                    export DATABRICKS_HOST=$(HOST_DEV)
                    export DATABRICKS_TOKEN=$(TOKEN_DEV)
                    export WS_ID=$(WS_ID_DEV)
                    export STORAGE_NAME=$(STOR_DEV)
                    location_name="${{ container }}_dev"
                  else
                    export DATABRICKS_HOST=$(HOST_PROD)
                    export DATABRICKS_TOKEN=$(TOKEN_PROD)
                    export WS_ID=$(WS_ID_PROD)
                    export STORAGE_NAME=$(STOR_PROD)
                    location_name="${{ container }}"
                  fi

                  echo "Creating external location: $location_name"

                  # Create external location using container name in the URL and environment-specific storage account
                  echo "databricks external-locations create $(client)_${location_name} abfss://${{ container }}@$STORAGE_NAME.dfs.core.windows.net/ tuidistoragecredential"
                  databricks external-locations create $(client)_${location_name} abfss://${{ container }}@$STORAGE_NAME.dfs.core.windows.net/ tuidistoragecredential

                  # Set isolation mode to ISOLATED
                  databricks external-locations update $(client)_${location_name} --isolation-mode ISOLATION_MODE_ISOLATED

                  # Grant access to the current workspace
                  databricks workspace-bindings update-bindings external-location $(client)_${location_name} --json "{\"add\": [{\"workspace_id\": $WS_ID, \"binding_type\": \"BINDING_TYPE_READ_WRITE\"}]}"
                  
                  # Set dataeng service principal as owner
                  databricks external-locations update $(client)_${location_name} --owner "dataeng"
                displayName: 'Create and Configure External Location for ${{ container }} in ${{ env }}'

      # Step 7: Create a catalog for the environment (only once per environment)
      - ${{ each env in parameters.environments }}:
          - script: |
              source $(Agent.BuildDirectory)/brew_env.sh
              echo "Creating catalog for environment: ${{ env }}"

              # Set environment-specific storage root for catalog
              if [ "${{ env }}" == "dev" ]; then
                export DATABRICKS_HOST=$(HOST_DEV)
                export DATABRICKS_TOKEN=$(TOKEN_DEV)
                export WS_ID=$(WS_ID_DEV)
                export STORAGE_NAME=$(STOR_DEV)
                storage_root="abfss://externallocationuc@$(STOR_DEV).dfs.core.windows.net/"
              else
                export DATABRICKS_HOST=$(HOST_PROD)
                export DATABRICKS_TOKEN=$(TOKEN_PROD)
                export WS_ID=$(WS_ID_PROD)
                export STORAGE_NAME=$(STOR_PROD)
                storage_root="abfss://externallocationuc@$(STOR_PROD).dfs.core.windows.net/"
              fi

              # Create catalog for the client in this environment
              databricks catalogs create $(client) --storage-root $storage_root

              echo "Catalog $(client)_${{ env }} created with storage root: $storage_root"
              
              # Update catalog owner to dataeng service principal
              databricks catalogs update $(client) --owner "dataeng"

              echo "Catalog owner for $(client) set to dataeng in ${{ env }}"

              # create schema bronze

              schema_exists=$(databricks schemas list --catalog-name $(client) --output json | jq -r ".[] | select(.name == \"bronze\") | .name")
              if [ -z "$schema_exists" ]; then
                  databricks schemas create bronze $(client) --comment "Schema bronze for $(client)"
                  echo "Schema 'bronze' created successfully."
              else
                  echo "Schema 'bronze' already exists."
              fi

              #databricks schemas create bronze $(client)        

Conclusion

This article illustrated the process to automate the deployment of resources within the Azure and Databricks environments using a DevOps pipeline.

We would love to hear your feedback on our approach. Is there anything we could add to the pipeline or anything we could do differently?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了