Databricks: Enabling safety in utility jobs
I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project where we are loading data into bronze layer, followed by silver layer and gold layer. These activities are implemented using well-defined notebooks and well-defined jobs. This is Business As Usual (BAU).
We also need to perform ad-hoc activities like creating tables, adding columns, taking backup and more. These activities are performed when deploying new requirements or as part of change requests or as a part of routine maintenance on the platform.
To make execution of such tasks well-defined, I created empty notebooks and corresponding jobs. When we have to perform an ad-hoc activity, the team edits the notebook and executes the corresponding job. Job done.
You might be wondering - why define a job for this purpose? Why not define notebooks and execute them using suitable permissions? Two reasons. The first reason is permissions. We have ensured that the underlying schema can be modified using service principal (SP) permissions only. We have defined the job to execute as SP (run as SP) so that it has the relevant permissions. By this approach, we do not need to enable permissions for individual notebooks. The second reason is more important - visibility. By requiring that we execute the job and not the notebook, we improve visibility of execution and can track what was executed. How? When Databricks executes a job and the underlying notebook, we can open the run details, which in turn displays the notebook that was executed, effectively displaying the code that was executed. When in doubt, we can open the concerned execution and check what was executed. If direct notebook execution is enabled (like we typically do on Dev), we cannot go back in time and check what was executed. Databricks will display information only for the last execution.
While we can check what was executed by a job, how can we ensure that the ad-hoc job executes only once? What I mean is this. In most situations - unless controlled properly by the people in charge - it will happen that the code related to the ad-hoc job will be added to the notebook (or a new notebook will be attached to the job), executed and then forgotten.
'With great power comes great responsibility'.
When we create structures that allow a team to execute ad-hoc code on Production environment, the team has to ensure that they execute with proper care. This means that after execution of the ad-hoc code, we have to ensure that the code is removed or commented such that inadvertent execution of the job does not corrupt the environment. We have to follow the principle of idempotency (leave the system in the same state that it was found in). By default, the job that is used to execute ad-hoc jobs, points to 'do nothing' / empty notebook. After we edit the notebook / or attach a different notebook, we have to ensure that we edit the notebook / job definition and remove the ad-hoc code / attach the original notebook. By doing this, we ensure that accidental execution of the job does not result in execution of the ad-hoc code all over again, which can corrupt existing structures and data.
领英推荐
It is difficult to implement guard-rails when new notebooks are attached to the job, as we cannot control the code that the team will include in the notebook. But, if we execute ad-hoc jobs using a standard notebook, we can implement guard-rails.
I implemented such guard-rails in the project by exploiting the widget parameters that can be specified for a job. For each job, I defined a widget named 'enabled'. The default value for this widget is 'no'. Whenever we wish to perform an ad-hoc execution, we have to change the value from 'no' to 'yes'. How is this a guard-rail? In the notebook, in the first cell, we check the value of the widget. We continue execution only if the value is 'yes'. Else we throw an Exception. Once again. How is this is guard-rail? It is not, unless we add a piece of code in the last cell of the notebook.
In the last cell, we make use of the Databricks REST API to update the job definition and change the widget value to 'no'.
How does this work? When a team wants to perform ad-hoc change, they edit the job definition and change 'enabled' to 'yes'. Then the job is executed. The job executes the underlying notebook. The first cell checks the value of the widget. As the value is 'yes', the remaining cells in the notebook are executed. In the last cell, the value of the widget is changed back to 'no'.
With this guard-rail, if someone forgets to remove the ad-hoc after execution, and the job is executed, it will throw an error because the 'enabled' flag value is 'no'. On getting an error, the team will be forced to take a look at the error and will realize the mistake.
This is not a fool proof solution, but at least we have put in place a mechanism that tries to reduce the chances of mistakes.
#databricks #ad_hoc_changes #production #guard_rails #job #notebook #utility