CI / CD in Azure Databricks using 
Azure DevOps - Part 1

CI / CD in Azure Databricks using Azure DevOps - Part 1

CICD in Azure Databricks using Azure DevOps is a big topic so I have broken this article into 2 parts as below

Part 1: Version Control Databricks Notebooks, here we will cover the steps to move workspace notebook to repo, and push it to GitHub

Part 2: Azure DevOpsBuild to build CI/CD pipeline to deploy the notebooks , in this part 2 article we will cover the end to end deployment pipeline

Before I deep dive into the topic lets look at what is Databricks, it is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP

Databricks is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP

Databricks was founded by the original creators of Apache spark it is developed as a web-based platform for implementing Apache spark, it helps to implement complex data analytics code that needs cluster computing

It provides automated cluster management and IPython styled notebooks so the best thing about databricks is that the cluster management is completely?automated making it a great Platform As A Service, we absolutely don't need to manage any aspect of the cluster manually, all of its automated and top of that we get IPython style notebooks.

We get all these things pre-installed with databricks so we have a cluster ( multi-node / single node cluster ), On this cluster we can launch notebooks?written in ( SQL, pySpark and Spark Scala ). The code in notebooks will utilize the full functionality of the cluster, the best thing about the databricks is Automated Cluster Management and Coding notebooks in multiple languages like ( SQL, pySpark, Spark Scala, Java )

Clusters: - Clusters are nothing but a set of virtual machines that are used to do work

Databricks Provides

  1. Flexibility ( pick any language of your choice : SQL, Scala, Python, Java )
  2. Workflow Optimisation ( intuitive orchestration service )
  3. Collaborative environment?for data engineering, data scientist and data analyst

By now it is clear the Databricks is managed spark as PaaS service for cluster computing, so lets start with the main topic of this article, i.e Version control of the notebooks. The databricks part of the workspace as just available in the platform for development, testing, execution etc, but these notebooks under workspace are not version controlled, so if someone accidentally deletes the notebook its gone, Also the changes are not tracked if its just in the workspace, so its very important for code management and release management to move the codebase to a safe location which is version controlled

Detail steps of Moving notebooks to Repos

Repos feature is a recently added feature, it is integration with a remote code repository, this will make it easier to version control the development and follow the best practices, Databricks supports integration with GitHub, Azure DevOps Repos

Step 1: -Add Repo to the Repos

Select new and click on repo to create new repo , after clicking a new pop-up will appear of add repo in that we have to provide Git/AzureRepo Integration URL based on the GIT provider

No alt text provided for this image
Landing Page of Databricks with new feature "Repos"
No alt text provided for this image
Add Repo

In case of Azure DevOPs to get that URL go to AzureDevops Select the folder which we want to add in to Repository, select clone and copy the URL and paste it in the AzureDatabricks <<Add Repo

No alt text provided for this image
copy the URL

Step 2 : Move the notebooks to the recently added repo as below

Go to the workspace containing the notebook you wish to move. Click the downward arrow next to the notebook you wish to move. Select 'Move'

No alt text provided for this image
Here select the notebooks, folders, etc you want to move from the workspace to the repo Select Workspace
No alt text provided for this image
Move the Workspace

Step 3: -In the directory, select 'Repos' then select the repo you are interested in. Click Select.

No alt text provided for this image
Moving location for the Workspace notebook

Step 4: - Once the notebook has been moved, Click the downward arrow beside the repo name. Click "Git..."

Notebooks can be committed into a Git repository either by linking a Git repository to the notebook in the Databricks Workspace or by manually exporting the notebook as a?Source File.

No alt text provided for this image
Git Integration

Step 5: - Review changes made, and add a commit message (required). Click "Commit & Push"

No alt text provided for this image
Commit and Push

Step 6: - The notebook should now be visible in GitHub.

No alt text provided for this image
Notebooks committed to Rep

Before the advent of databricks workflows notebooks were mostly orchestrated or scheduled using Azure Data Factory, Or if we want to execute/orchestrate through ADF follow these steps

Step 7: -Now go back to our Azure portal and open Azure Data Factory

No alt text provided for this image
Azure portal

Step 8: -Select Launch Studio

No alt text provided for this image
Launch studio
No alt text provided for this image
Open the required Azure Data Factory

Step 9: - Select Linked Service pointing to that databricks notebook

No alt text provided for this image
Open the Linked Service

Step 10: - Add the Access Token and link the service

No alt text provided for this image
Copy Paste the Access Token

Step 11: -Generate the access token

Go to Azure Databricks and select User Settings in that select Access Token and Generate new token and copy the Access Token.

No alt text provided for this image
User Settings
No alt text provided for this image
Generate the access token
No alt text provided for this image
copy the access token

Step 12: - Now paste the copied access token and select Apply, and select Test Connection.

No alt text provided for this image
Paste the access token

Step 13: - Test connection Successful.

No alt text provided for this image
Test Connection Successful

Step 14: - Select Factory Resources in that select Pipelines and then select New Pipeline.

No alt text provided for this image
Pipeline in the Factory Resources

Step 15: - In that select Databricks and select Notebook.

No alt text provided for this image
Select Databricks and Notebooks

Step 16: - Select the databricks which we linked before.

No alt text provided for this image
Browse notebooks

Step 17: - In the Settings select Browse and Select the notebook which we want to test.

No alt text provided for this image
Browse the notebooks
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
Select the notebook

Step 18: - Now after selecting the notebook Test the connection and we can see that the connection is successful.

No alt text provided for this image
Test Connection Successful

At this point, the Databricks orchestration through ADF is done. Now the code is in Repo and thoroughly tested manually and through ADF orchestration

At the end I would say never leave the notebooks in workspace, move them to repo and version control


Let me know your thoughts, this article or a detail tutorial is designed to help you move codebase in repo and test it manually and through orchestration in next article we are going to explore more on end to end CICD deployment process using Azure DevOps

Cheers

Yogesh

Thanks yogesh it's been a very informative and very well explained article...keep motivating

回复
Tanya Mani

Engineering Manager at Maersk Global Service Centres

2 年

Yogesh Dipankar Very detailed information.. thanks for sharing

要查看或添加评论,请登录

社区洞察

其他会员也浏览了