CI / CD in Azure Databricks using Azure DevOps - Part 1
Yogesh Dipankar
Global Head : IoT Telematics, Maps & AI Data Monetization | 10+ Patents Filed | 3X StartUps
CICD in Azure Databricks using Azure DevOps is a big topic so I have broken this article into 2 parts as below
Part 1: Version Control Databricks Notebooks, here we will cover the steps to move workspace notebook to repo, and push it to GitHub
Part 2: Azure DevOpsBuild to build CI/CD pipeline to deploy the notebooks , in this part 2 article we will cover the end to end deployment pipeline
Before I deep dive into the topic lets look at what is Databricks, it is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP
Databricks is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP
Databricks was founded by the original creators of Apache spark it is developed as a web-based platform for implementing Apache spark, it helps to implement complex data analytics code that needs cluster computing
It provides automated cluster management and IPython styled notebooks so the best thing about databricks is that the cluster management is completely?automated making it a great Platform As A Service, we absolutely don't need to manage any aspect of the cluster manually, all of its automated and top of that we get IPython style notebooks.
We get all these things pre-installed with databricks so we have a cluster ( multi-node / single node cluster ), On this cluster we can launch notebooks?written in ( SQL, pySpark and Spark Scala ). The code in notebooks will utilize the full functionality of the cluster, the best thing about the databricks is Automated Cluster Management and Coding notebooks in multiple languages like ( SQL, pySpark, Spark Scala, Java )
Clusters: - Clusters are nothing but a set of virtual machines that are used to do work
Databricks Provides
By now it is clear the Databricks is managed spark as PaaS service for cluster computing, so lets start with the main topic of this article, i.e Version control of the notebooks. The databricks part of the workspace as just available in the platform for development, testing, execution etc, but these notebooks under workspace are not version controlled, so if someone accidentally deletes the notebook its gone, Also the changes are not tracked if its just in the workspace, so its very important for code management and release management to move the codebase to a safe location which is version controlled
Detail steps of Moving notebooks to Repos
Repos feature is a recently added feature, it is integration with a remote code repository, this will make it easier to version control the development and follow the best practices, Databricks supports integration with GitHub, Azure DevOps Repos
Step 1: -Add Repo to the Repos
Select new and click on repo to create new repo , after clicking a new pop-up will appear of add repo in that we have to provide Git/AzureRepo Integration URL based on the GIT provider
In case of Azure DevOPs to get that URL go to AzureDevops Select the folder which we want to add in to Repository, select clone and copy the URL and paste it in the AzureDatabricks <<Add Repo
Step 2 : Move the notebooks to the recently added repo as below
Go to the workspace containing the notebook you wish to move. Click the downward arrow next to the notebook you wish to move. Select 'Move'
Step 3: -In the directory, select 'Repos' then select the repo you are interested in. Click Select.
Step 4: - Once the notebook has been moved, Click the downward arrow beside the repo name. Click "Git..."
Notebooks can be committed into a Git repository either by linking a Git repository to the notebook in the Databricks Workspace or by manually exporting the notebook as a?Source File.
Step 5: - Review changes made, and add a commit message (required). Click "Commit & Push"
Step 6: - The notebook should now be visible in GitHub.
Before the advent of databricks workflows notebooks were mostly orchestrated or scheduled using Azure Data Factory, Or if we want to execute/orchestrate through ADF follow these steps
Step 7: -Now go back to our Azure portal and open Azure Data Factory
领英推荐
Step 8: -Select Launch Studio
Step 9: - Select Linked Service pointing to that databricks notebook
Step 10: - Add the Access Token and link the service
Step 11: -Generate the access token
Go to Azure Databricks and select User Settings in that select Access Token and Generate new token and copy the Access Token.
Step 12: - Now paste the copied access token and select Apply, and select Test Connection.
Step 13: - Test connection Successful.
Step 14: - Select Factory Resources in that select Pipelines and then select New Pipeline.
Step 15: - In that select Databricks and select Notebook.
Step 16: - Select the databricks which we linked before.
Step 17: - In the Settings select Browse and Select the notebook which we want to test.
Step 18: - Now after selecting the notebook Test the connection and we can see that the connection is successful.
At this point, the Databricks orchestration through ADF is done. Now the code is in Repo and thoroughly tested manually and through ADF orchestration
At the end I would say never leave the notebooks in workspace, move them to repo and version control
Let me know your thoughts, this article or a detail tutorial is designed to help you move codebase in repo and test it manually and through orchestration in next article we are going to explore more on end to end CICD deployment process using Azure DevOps
Cheers
Yogesh
Analyst at Fiserv India Pvt ltd
2 年Thanks yogesh it's been a very informative and very well explained article...keep motivating
Engineering Manager at Maersk Global Service Centres
2 年Yogesh Dipankar Very detailed information.. thanks for sharing
--
2 年Nice article ????