CI / CD in Azure Databricks using Azure DevOps

CI / CD in Azure Databricks using Azure DevOps

In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please read that article first & follow all the implementation of it. That will be the first prerequisite. Here is the link for my article.

Now, let's get started. First little bit of theoretical stuff.

Who benefits from DevOps?

Everyone. Once properly configured, automated testing and deployment can free up your engineering team and enable your data team to push their changes into production. For example:

  • Data engineers can easily deploy changes to generate new tables for BI analysts.
  • Data scientists can update models being used in production.
  • Data analysts can modify scripts being used to generate dashboards.

In short, changes made to a Databricks notebook can be pushed to production with a simple mouse click (and then any amount of oversight that your DevOps team feels is appropriate).

Before, We start let's review what we have right now.

  1. Two Azure Databricks Workspace ( devdatabricks & proddatabricks )
  2. An Azure DevOps project & repo - "databricks"
  3. Our Databricks Notebook - "Covid19_SQLServerLoad.py" is already connected to the repo & we have made couple of commits in it already. ( see below )
No alt text provided for this image

Alright, Let's move further to implement the CI - CD process step by step.

Step1: Navigate to - proddatabricks & note the URL ( get the location portion of this )

No alt text provided for this image

Step2: Generate the Access Token. Navigate to User Icon --> User Settings --> Access Tokens tab --> Generate New Token. Give some comment like - "For Azure DevOps" , Keep the default 90 days & click on Generate. Copy the generated token somewhere in the Notepad.

No alt text provided for this image

Step3: Navigate to Azure DevOps Organisation & go to the project. ( in our case the project name is - "databricks" ). Click on Create Pipeline

No alt text provided for this image

Step4: Click on the link - "use the classic editor" down below.

No alt text provided for this image

Step5: Select "Azure Devops Git". You project & repository will start appearing there by default.

No alt text provided for this image

Step6: Click on "Continue" & select "Empty Job".

No alt text provided for this image

Step7: Click on the "+" sign on the Agent Job 1

No alt text provided for this image

Step8: Search for the "Publish Build Artifacts" task and add it to a Agent job. Select added task, enter notebooks ( select the notebook from our workspace) for the Path to publish and enter DEV Project ( this is a custom name / you can define whatever you want) for the Artifact name.

No alt text provided for this image

Step9: Navigate to the Triggers tab and check Enable continuous integration. This will automatically trigger a build whenever you commit your code to the repo.

No alt text provided for this image

Step10: Click Save & queue to continue. In the Run pipeline dialog that appears, enter a save comment ( " setting up the pipeline " ), then select Save and run.

No alt text provided for this image

Step11: Verify that your build pipeline was created and successfully run. It will be in progress & then success.

No alt text provided for this image

At this point, we have successfully create a Continuous Integration pipeline.

Step12: Now let's create the Azure DevOps - release pipeline (CD). Select Releases. Select New pipeline.

No alt text provided for this image

Step13: Navigate to the start with an Empty job link. Select Add an artifact. Set the source type to Build, then select your build pipeline ( in our case - "databricks-CI" ) we have created in the previous step as the Source. Select Add to apply the changes.

No alt text provided for this image

Step14: View the tasks for Stage 1 by selecting the 1 job, 0 task link.

No alt text provided for this image

Step15: Select the + link on Agent job to add a task. Search "Databricks", then add Databricks Deploy Notebooks.

Note: ( Very Important ) : We first have to install "Databricks Script Deployment Task by Data Thirst", then the displayed Databricks tasks will become available. This package is provided by 3rd party. Just click on - "Databricks Script Deployment Task by Data Thirst" & follow along, it will get installed.

Then refresh the URL & start creating the Agent Job again.

No alt text provided for this image

Step16: Fill the following for the - Databricks Deploy Notebooks.

  • Azure Region: Enter the region of your production (PROD) Azure Databricks workspace that we have obtained from the URL in a previous step. ( i.e. adb-59... )
  • Source files path: Browse to and select up to the subfolder.
  • Target files path: Enter /Shared.
  • Databricks bearer token: Paste the Azure Databricks Access Key which we have copied in an earlier step.
dapi37742f03cd5765aac927868d5cde0370


No alt text provided for this image


Step17: Navigate to the Pipeline tab. Select the Continuous deployment trigger on the artifact, then Enable the continuous deployment trigger.

This will create a release every time a new build is available.

No alt text provided for this image

Step18: Select Save to save your pipeline changes. Finally, create a release by selecting Create release at the top of the pipeline blade. When the Create a new release form displays, select Create.

No alt text provided for this image

Step19: Navigate back to Releases under the Pipelines section of the left-hand menu. Select the release you just created. When it opens, you should see that it is either in progress or completed. ( I have captured both )

No alt text provided for this image
No alt text provided for this image

Navigate back to your production (PROD) Azure Databricks workspace. If it is already open, refresh the page. Navigate to your "Shared" folder under the workspace. You should see your notebook. This was saved to our workspace by our release pipeline.

CI/CD setup is now completed. If we commit our code from the DEV workspace to the repo (main branch), the same notebook should be available in PROD.

Experiment with making changes to your notebook in DEV, then committing those changes. You will be able to see your build and release pipelines execute and the notebook in the PROD workspace automatically update to reflect those changes.

Don't believe on me? Let's try

I am going to make some changes in my Notebook which is in "devdatabricks" Workspace. I am just going to add one print statement "Today is Friday".

( Note the workspace URL so that you can distinguish between DEV & PROD )

No alt text provided for this image

Navigate to Release Pipeline in Azure DevOps. You will see a "new release" ( Release-2) being created automatically.

No alt text provided for this image

That release is successful after few seconds.

No alt text provided for this image

Navigate to "proddatabricks" workspace & check the Notebook for the change.

No alt text provided for this image

Amazing !! isn't it ? :). Awesome feeling when the work gets completed seamlessly.

That's it, we are done. It was quite a task to setup the CI / CD process but once it is done, it's work like a charm. Step by step, we have managed to accomplish quite a lengthy task.

You can also add another workspace in between the DEV & PROD, something like an STAG. From DEV to STAG , you can have automated process & from STAG to PROD, you can have manual process via human approval. Try this by your own, let me know your feedback, if the process works for you. Good Luck !!

This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content. 

Thanks !! Happy Weekend, Happy Learning !!

Ricardo Galiardi

Data Solution Architect - Big Data | Data Engineer | Data Architect | Data Scientist | Machine Learning

1 年
回复
Guilherme Lira de Oliveira Nascimento

Engenheiro de Dados com ênfase na plataforma Azure utilizando Data Factory, DataLake, Databricks, PySpark, BI, Kanban/Scrum. Bacharel em Sistemas de Informa??o.

2 年

Hello, are you ok? Very nice article, congrats. I did all the steps but my release is not works, the notebooks in dev don't go to prod and the process don't show message error, can you help me with this problem? Any tips? I thank.

回复
Mohammed Fareed

Senior Azure DevOps Engineer at Vistas Global

3 年

I am practically following up this article. But got blocked at step 15. Blocker: Installing "Databricks Script Deployment Task by Data Thirst" is must. But to Install, Azure DevOps Project Administrator permission is needed. It got rejected because this third party extension could be deprecated. Query: Is there an alternative? Kindly help.

回复

要查看或添加评论,请登录

Deepak Rajak的更多文章

  • Multi Tasks Job in Databricks

    Multi Tasks Job in Databricks

    A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

    3 条评论
  • Deploying Databricks on Azure

    Deploying Databricks on Azure

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    9 条评论
  • Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

    10 条评论
  • Create Tables in Databricks & Query it from AWS Athena

    Create Tables in Databricks & Query it from AWS Athena

    In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

    2 条评论
  • AWS Glue Data Catalog as the Metastore for Databricks

    AWS Glue Data Catalog as the Metastore for Databricks

    We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

    10 条评论
  • Deploying Databricks on AWS

    Deploying Databricks on AWS

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    1 条评论
  • Danny's Diner Case Study using Pyspark on Databricks

    Danny's Diner Case Study using Pyspark on Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

    9 条评论
  • Azure Cloud Data Engineering

    Azure Cloud Data Engineering

    You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

    22 条评论
  • Deploying Databricks on Google Cloud Platform

    Deploying Databricks on Google Cloud Platform

    Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

    4 条评论
  • Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    Read / Write from AWS S3 , Azure DataLake Storage & Google Cloud Storage without mounting via Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to interact with Data [ Files ( different format…

    8 条评论

社区洞察

其他会员也浏览了