CI / CD in Azure Databricks using Azure DevOps
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please read that article first & follow all the implementation of it. That will be the first prerequisite. Here is the link for my article.
Now, let's get started. First little bit of theoretical stuff.
Who benefits from DevOps?
Everyone. Once properly configured, automated testing and deployment can free up your engineering team and enable your data team to push their changes into production. For example:
- Data engineers can easily deploy changes to generate new tables for BI analysts.
- Data scientists can update models being used in production.
- Data analysts can modify scripts being used to generate dashboards.
In short, changes made to a Databricks notebook can be pushed to production with a simple mouse click (and then any amount of oversight that your DevOps team feels is appropriate).
Before, We start let's review what we have right now.
- Two Azure Databricks Workspace ( devdatabricks & proddatabricks )
- An Azure DevOps project & repo - "databricks"
- Our Databricks Notebook - "Covid19_SQLServerLoad.py" is already connected to the repo & we have made couple of commits in it already. ( see below )
Alright, Let's move further to implement the CI - CD process step by step.
Step1: Navigate to - proddatabricks & note the URL ( get the location portion of this )
Step2: Generate the Access Token. Navigate to User Icon --> User Settings --> Access Tokens tab --> Generate New Token. Give some comment like - "For Azure DevOps" , Keep the default 90 days & click on Generate. Copy the generated token somewhere in the Notepad.
Step3: Navigate to Azure DevOps Organisation & go to the project. ( in our case the project name is - "databricks" ). Click on Create Pipeline
Step4: Click on the link - "use the classic editor" down below.
Step5: Select "Azure Devops Git". You project & repository will start appearing there by default.
Step6: Click on "Continue" & select "Empty Job".
Step7: Click on the "+" sign on the Agent Job 1
Step8: Search for the "Publish Build Artifacts" task and add it to a Agent job. Select added task, enter notebooks ( select the notebook from our workspace) for the Path to publish and enter DEV Project ( this is a custom name / you can define whatever you want) for the Artifact name.
Step9: Navigate to the Triggers tab and check Enable continuous integration. This will automatically trigger a build whenever you commit your code to the repo.
Step10: Click Save & queue to continue. In the Run pipeline dialog that appears, enter a save comment ( " setting up the pipeline " ), then select Save and run.
Step11: Verify that your build pipeline was created and successfully run. It will be in progress & then success.
At this point, we have successfully create a Continuous Integration pipeline.
Step12: Now let's create the Azure DevOps - release pipeline (CD). Select Releases. Select New pipeline.
Step13: Navigate to the start with an Empty job link. Select Add an artifact. Set the source type to Build, then select your build pipeline ( in our case - "databricks-CI" ) we have created in the previous step as the Source. Select Add to apply the changes.
Step14: View the tasks for Stage 1 by selecting the 1 job, 0 task link.
Step15: Select the + link on Agent job to add a task. Search "Databricks", then add Databricks Deploy Notebooks.
Note: ( Very Important ) : We first have to install "Databricks Script Deployment Task by Data Thirst", then the displayed Databricks tasks will become available. This package is provided by 3rd party. Just click on - "Databricks Script Deployment Task by Data Thirst" & follow along, it will get installed.
Then refresh the URL & start creating the Agent Job again.
Step16: Fill the following for the - Databricks Deploy Notebooks.
- Azure Region: Enter the region of your production (PROD) Azure Databricks workspace that we have obtained from the URL in a previous step. ( i.e. adb-59... )
- Source files path: Browse to and select up to the subfolder.
- Target files path: Enter /Shared.
- Databricks bearer token: Paste the Azure Databricks Access Key which we have copied in an earlier step.
dapi37742f03cd5765aac927868d5cde0370
Step17: Navigate to the Pipeline tab. Select the Continuous deployment trigger on the artifact, then Enable the continuous deployment trigger.
This will create a release every time a new build is available.
Step18: Select Save to save your pipeline changes. Finally, create a release by selecting Create release at the top of the pipeline blade. When the Create a new release form displays, select Create.
Step19: Navigate back to Releases under the Pipelines section of the left-hand menu. Select the release you just created. When it opens, you should see that it is either in progress or completed. ( I have captured both )
Navigate back to your production (PROD) Azure Databricks workspace. If it is already open, refresh the page. Navigate to your "Shared" folder under the workspace. You should see your notebook. This was saved to our workspace by our release pipeline.
CI/CD setup is now completed. If we commit our code from the DEV workspace to the repo (main branch), the same notebook should be available in PROD.
Experiment with making changes to your notebook in DEV, then committing those changes. You will be able to see your build and release pipelines execute and the notebook in the PROD workspace automatically update to reflect those changes.
Don't believe on me? Let's try
I am going to make some changes in my Notebook which is in "devdatabricks" Workspace. I am just going to add one print statement "Today is Friday".
( Note the workspace URL so that you can distinguish between DEV & PROD )
Navigate to Release Pipeline in Azure DevOps. You will see a "new release" ( Release-2) being created automatically.
That release is successful after few seconds.
Navigate to "proddatabricks" workspace & check the Notebook for the change.
Amazing !! isn't it ? :). Awesome feeling when the work gets completed seamlessly.
That's it, we are done. It was quite a task to setup the CI / CD process but once it is done, it's work like a charm. Step by step, we have managed to accomplish quite a lengthy task.
You can also add another workspace in between the DEV & PROD, something like an STAG. From DEV to STAG , you can have automated process & from STAG to PROD, you can have manual process via human approval. Try this by your own, let me know your feedback, if the process works for you. Good Luck !!
This marks the end to this article. I hope, I am able to provide you something new to learn. Thanks for reading, Please provide your feedback in the comment section. Please like & share if you have liked the content.
Thanks !! Happy Weekend, Happy Learning !!
Data Solution Architect - Big Data | Data Engineer | Data Architect | Data Scientist | Machine Learning
1 年Ullas Vashista
Engenheiro de Dados com ênfase na plataforma Azure utilizando Data Factory, DataLake, Databricks, PySpark, BI, Kanban/Scrum. Bacharel em Sistemas de Informa??o.
2 年Hello, are you ok? Very nice article, congrats. I did all the steps but my release is not works, the notebooks in dev don't go to prod and the process don't show message error, can you help me with this problem? Any tips? I thank.
Senior Azure DevOps Engineer at Vistas Global
3 年I am practically following up this article. But got blocked at step 15. Blocker: Installing "Databricks Script Deployment Task by Data Thirst" is must. But to Install, Azure DevOps Project Administrator permission is needed. It got rejected because this third party extension could be deprecated. Query: Is there an alternative? Kindly help.