登录查看更多内容

CI / CD in Azure Databricks using Azure DevOps - Part 1

Yogesh Dipankar

Global Head : IoT Telematics, Maps & AI Data Monetization | 10+ Patents Filed | 3X StartUps

发布日期: 2022年11月10日

CICD in Azure Databricks using Azure DevOps is a big topic so I have broken this article into 2 parts as below

Part 1: Version Control Databricks Notebooks, here we will cover the steps to move workspace notebook to repo, and push it to GitHub

Part 2: Azure DevOpsBuild to build CI/CD pipeline to deploy the notebooks , in this part 2 article we will cover the end to end deployment pipeline

Before I deep dive into the topic lets look at what is Databricks, it is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP

Databricks is a web based managed Apache Spark Service that is hosted on all popular clouds like Azure, AWS and recently GCP

Databricks was founded by the original creators of Apache spark it is developed as a web-based platform for implementing Apache spark, it helps to implement complex data analytics code that needs cluster computing

It provides automated cluster management and IPython styled notebooks so the best thing about databricks is that the cluster management is completely?automated making it a great Platform As A Service, we absolutely don't need to manage any aspect of the cluster manually, all of its automated and top of that we get IPython style notebooks.

We get all these things pre-installed with databricks so we have a cluster ( multi-node / single node cluster ), On this cluster we can launch notebooks?written in ( SQL, pySpark and Spark Scala ). The code in notebooks will utilize the full functionality of the cluster, the best thing about the databricks is Automated Cluster Management and Coding notebooks in multiple languages like ( SQL, pySpark, Spark Scala, Java )

Clusters: - Clusters are nothing but a set of virtual machines that are used to do work

Databricks Provides

Flexibility ( pick any language of your choice : SQL, Scala, Python, Java )
Workflow Optimisation ( intuitive orchestration service )
Collaborative environment?for data engineering, data scientist and data analyst

By now it is clear the Databricks is managed spark as PaaS service for cluster computing, so lets start with the main topic of this article, i.e Version control of the notebooks. The databricks part of the workspace as just available in the platform for development, testing, execution etc, but these notebooks under workspace are not version controlled, so if someone accidentally deletes the notebook its gone, Also the changes are not tracked if its just in the workspace, so its very important for code management and release management to move the codebase to a safe location which is version controlled

Detail steps of Moving notebooks to Repos

Repos feature is a recently added feature, it is integration with a remote code repository, this will make it easier to version control the development and follow the best practices, Databricks supports integration with GitHub, Azure DevOps Repos

Step 1: -Add Repo to the Repos

Select new and click on repo to create new repo , after clicking a new pop-up will appear of add repo in that we have to provide Git/AzureRepo Integration URL based on the GIT provider

No alt text provided for this image — Landing Page of Databricks with new feature "Repos"

In case of Azure DevOPs to get that URL go to AzureDevops Select the folder which we want to add in to Repository, select clone and copy the URL and paste it in the AzureDatabricks <<Add Repo

Step 2 : Move the notebooks to the recently added repo as below

Go to the workspace containing the notebook you wish to move. Click the downward arrow next to the notebook you wish to move. Select 'Move'

Step 3: -In the directory, select 'Repos' then select the repo you are interested in. Click Select.

Step 4: - Once the notebook has been moved, Click the downward arrow beside the repo name. Click "Git..."

Notebooks can be committed into a Git repository either by linking a Git repository to the notebook in the Databricks Workspace or by manually exporting the notebook as a?Source File.

Step 5: - Review changes made, and add a commit message (required). Click "Commit & Push"

Step 6: - The notebook should now be visible in GitHub.

Before the advent of databricks workflows notebooks were mostly orchestrated or scheduled using Azure Data Factory, Or if we want to execute/orchestrate through ADF follow these steps

Step 7: -Now go back to our Azure portal and open Azure Data Factory

Diogo Ribeiro 3 个月前

Learn Kubernetes weekly — issue 4

Learnk8s 1 年前

Welcome to the Collabnix Monthly Newsletter!

Ajeet Singh Raina 9 个月前

Step 8: -Select Launch Studio

Step 9: - Select Linked Service pointing to that databricks notebook

Step 10: - Add the Access Token and link the service

Step 11: -Generate the access token

Go to Azure Databricks and select User Settings in that select Access Token and Generate new token and copy the Access Token.

Step 12: - Now paste the copied access token and select Apply, and select Test Connection.

Step 13: - Test connection Successful.

Step 14: - Select Factory Resources in that select Pipelines and then select New Pipeline.

Step 15: - In that select Databricks and select Notebook.

Step 16: - Select the databricks which we linked before.

Step 17: - In the Settings select Browse and Select the notebook which we want to test.

Step 18: - Now after selecting the notebook Test the connection and we can see that the connection is successful.

At this point, the Databricks orchestration through ADF is done. Now the code is in Repo and thoroughly tested manually and through ADF orchestration

At the end I would say never leave the notebooks in workspace, move them to repo and version control

Let me know your thoughts, this article or a detail tutorial is designed to help you move codebase in repo and test it manually and through orchestration in next article we are going to explore more on end to end CICD deployment process using Azure DevOps

Cheers

Yogesh

Siddharth Verma ( activity looking for change )

Analyst at Fiserv India Pvt ltd

2 年

Thanks yogesh it's been a very informative and very well explained article...keep motivating

Tanya Mani

Engineering Manager at Maersk Global Service Centres

2 年

Yogesh Dipankar Very detailed information.. thanks for sharing

1 次回应

Amit Bhosale

2 年

Nice article ????

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

CI / CD in Azure Databricks using Azure DevOps - Part 1

Yogesh Dipankar

Global Head : IoT Telematics, Maps & AI Data Monetization | 10+ Patents Filed | 3X StartUps

CICD in Azure Databricks using Azure DevOps is a big topic so I have broken this article into 2 parts as below

Databricks Provides

Detail steps of Moving notebooks to Repos

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了