登录查看更多内容

CI / CD in Azure Databricks using Azure DevOps

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2021年4月9日

In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please read that article first & follow all the implementation of it. That will be the first prerequisite. Here is the link for my article.

Now, let's get started. First little bit of theoretical stuff.

Who benefits from DevOps?

Everyone. Once properly configured, automated testing and deployment can free up your engineering team and enable your data team to push their changes into production. For example:

Data engineers can easily deploy changes to generate new tables for BI analysts.
Data scientists can update models being used in production.
Data analysts can modify scripts being used to generate dashboards.

In short, changes made to a Databricks notebook can be pushed to production with a simple mouse click (and then any amount of oversight that your DevOps team feels is appropriate).

Before, We start let's review what we have right now.

Two Azure Databricks Workspace ( devdatabricks & proddatabricks )
An Azure DevOps project & repo - "databricks"
Our Databricks Notebook - "Covid19_SQLServerLoad.py" is already connected to the repo & we have made couple of commits in it already. ( see below )

Alright, Let's move further to implement the CI - CD process step by step.

Step1: Navigate to - proddatabricks & note the URL ( get the location portion of this )

Step2: Generate the Access Token. Navigate to User Icon --> User Settings --> Access Tokens tab --> Generate New Token. Give some comment like - "For Azure DevOps" , Keep the default 90 days & click on Generate. Copy the generated token somewhere in the Notepad.

Step3: Navigate to Azure DevOps Organisation & go to the project. ( in our case the project name is - "databricks" ). Click on Create Pipeline

Step4: Click on the link - "use the classic editor" down below.

Step5: Select "Azure Devops Git". You project & repository will start appearing there by default.

Step6: Click on "Continue" & select "Empty Job".

Step7: Click on the "+" sign on the Agent Job 1

Step8: Search for the "Publish Build Artifacts" task and add it to a Agent job. Select added task, enter notebooks ( select the notebook from our workspace) for the Path to publish and enter DEV Project ( this is a custom name / you can define whatever you want) for the Artifact name.

Step9: Navigate to the Triggers tab and check Enable continuous integration. This will automatically trigger a build whenever you commit your code to the repo.

Step10: Click Save & queue to continue. In the Run pipeline dialog that appears, enter a save comment ( " setting up the pipeline " ), then select Save and run.

Step11: Verify that your build pipeline was created and successfully run. It will be in progress & then success.

At this point, we have successfully create a Continuous Integration pipeline.

Step12: Now let's create the Azure DevOps - release pipeline (CD). Select Releases. Select New pipeline.

Step13: Navigate to the start with an Empty job link. Select Add an artifact. Set the source type to Build, then select your build pipeline ( in our case - "databricks-CI" ) we have created in the previous step as the Source. Select Add to apply the changes.

Step14: View the tasks for Stage 1 by selecting the 1 job, 0 task link.

Step15: Select the + link on Agent job to add a task. Search "Databricks", then add Databricks Deploy Notebooks.

Note: ( Very Important ) : We first have to install "Databricks Script Deployment Task by Data Thirst", then the displayed Databricks tasks will become available. This package is provided by 3rd party. Just click on - "Databricks Script Deployment Task by Data Thirst" & follow along, it will get installed.

Then refresh the URL & start creating the Agent Job again.

Step16: Fill the following for the - Databricks Deploy Notebooks.

Azure Region: Enter the region of your production (PROD) Azure Databricks workspace that we have obtained from the URL in a previous step. ( i.e. adb-59... )
Source files path: Browse to and select up to the subfolder.
Target files path: Enter /Shared.
Databricks bearer token: Paste the Azure Databricks Access Key which we have copied in an earlier step.

dapi37742f03cd5765aac927868d5cde0370

Step17: Navigate to the Pipeline tab. Select the Continuous deployment trigger on the artifact, then Enable the continuous deployment trigger.

This will create a release every time a new build is available.

Step18: Select Save to save your pipeline changes. Finally, create a release by selecting Create release at the top of the pipeline blade. When the Create a new release form displays, select Create.

Step19: Navigate back to Releases under the Pipelines section of the left-hand menu. Select the release you just created. When it opens, you should see that it is either in progress or completed. ( I have captured both )

Navigate back to your production (PROD) Azure Databricks workspace. If it is already open, refresh the page. Navigate to your "Shared" folder under the workspace. You should see your notebook. This was saved to our workspace by our release pipeline.

CI/CD setup is now completed. If we commit our code from the DEV workspace to the repo (main branch), the same notebook should be available in PROD.

Experiment with making changes to your notebook in DEV, then committing those changes. You will be able to see your build and release pipelines execute and the notebook in the PROD workspace automatically update to reflect those changes.

Don't believe on me? Let's try

I am going to make some changes in my Notebook which is in "devdatabricks" Workspace. I am just going to add one print statement "Today is Friday".

( Note the workspace URL so that you can distinguish between DEV & PROD )

Navigate to Release Pipeline in Azure DevOps. You will see a "new release" ( Release-2) being created automatically.

That release is successful after few seconds.

Navigate to "proddatabricks" workspace & check the Notebook for the change.

Amazing !! isn't it ? :). Awesome feeling when the work gets completed seamlessly.

That's it, we are done. It was quite a task to setup the CI / CD process but once it is done, it's work like a charm. Step by step, we have managed to accomplish quite a lengthy task.

You can also add another workspace in between the DEV & PROD, something like an STAG. From DEV to STAG , you can have automated process & from STAG to PROD, you can have manual process via human approval. Try this by your own, let me know your feedback, if the process works for you. Good Luck !!

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.

Thanks !! Happy Weekend, Happy Learning !!

Ricardo Galiardi

Data Solution Architect - Big Data | Data Engineer | Data Architect | Data Scientist | Machine Learning

1 年

Ullas Vashista

Guilherme Lira de Oliveira Nascimento

Engenheiro de Dados com ênfase na plataforma Azure utilizando Data Factory, DataLake, Databricks, PySpark, BI, Kanban/Scrum. Bacharel em Sistemas de Informa??o.

2 年

Hello, are you ok? Very nice article, congrats. I did all the steps but my release is not works, the notebooks in dev don't go to prod and the process don't show message error, can you help me with this problem? Any tips? I thank.

Mohammed Fareed

Senior Azure DevOps Engineer at Vistas Global

3 年

I am practically following up this article. But got blocked at step 15. Blocker: Installing "Databricks Script Deployment Task by Data Thirst" is must. But to Install, Azure DevOps Project Administrator permission is needed. It got rejected because this third party extension could be deprecated. Query: Is there an alternative? Kindly help.

查看更多评论

要查看或添加评论，请登录

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

2022年1月12日

Multi Tasks Job in Databricks

A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

3 条评论
Deploying Databricks on Azure

2022年1月10日

Deploying Databricks on Azure

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

9 条评论
Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

10 条评论
Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

Create Tables in Databricks & Query it from AWS Athena

In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

2 条评论
AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

AWS Glue Data Catalog as the Metastore for Databricks

We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

10 条评论
Deploying Databricks on AWS

2021年10月31日

Deploying Databricks on AWS

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

1 条评论
Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Danny's Diner Case Study using Pyspark on Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

9 条评论
Azure Cloud Data Engineering

2021年6月8日

Azure Cloud Data Engineering

You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

22 条评论
Deploying Databricks on Google Cloud Platform

2021年4月13日

Deploying Databricks on Google Cloud Platform

Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

4 条评论
Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

2021年4月3日

Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

8 条评论

See all articles

CI / CD in Azure Databricks using Azure DevOps

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

Who benefits from DevOps?

Deepak Rajak的更多文章

社区洞察

其他会员也浏览了

The New Kind Of Industrial Revolution

Building Scalable Microservices with .NET Core and Docker

Lambda in AWS

Unlock Kafka’s Full Potential: An Intro to Best DevOps Practices

Devops Engineer wanted at Datarails

Demystifying Kubernetes ReplicaSets: Managing Replicated Pods Behind the Scenes