Change management is crucial (Databricks version)

My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting data from sources into the Bronze layer, applying transformations and then storing the data into the Silver layer, followed finally by the Gold layer where we created tables that were consumed by various reports.

Initially, as part of our design and implementation, we implemented the platform using the All Purpose Cluster as provided by Databricks. We had also implemented a job audit and an orchestration framework. For many months, things were going great.

Then the cost monster reared its head. We were informed that we need to move our computation over to a Job Cluster instead of the All Purpose Cluster. The was because the cost of running a job cluster was 1/5th that of an equivalent all purpose cluster. This is a very motivating use case - more dollars in the kitty for the customer.

We moved one of the smaller pipelines over to execute using a job cluster. We faced a few challenges in terms of permissions and some configs. All else was good. So we thought.

When we looked closely at the behaviour of the job, we found that our job cluster based pipeline was spending a lot of time in cluster startup.

And this is one of the biggest differences between using an all purpose cluster and a job cluster. When we use an all purpose cluster, it keeps running till it is turned off or it terminates on a condition. This allows us to execute multiple jobs one after another on the same job cluster. That is not the case when using a job cluster. When using a job cluster, a new instance of the job cluster is launched when a new job is executed. We observed that this launch time was around seven minutes. Hence, if our job ran for five minutes on the all purpose cluster, it took 12 minutes on the job cluster. This startup time may not matter much if the number of jobs being executed in parallel is large. But, if we have many sequential jobs, the startup time adds up and increases the time taken for execution.

As per our design, staging to Bronze was one job, Bronze to Silver was one job and Silver to Gold was one job. For a certain function area of the application, we had multiple such jobs that executed in the same 'master' / 'controller' job as these jobs had to be executed in a well-defined sequence / order, where some jobs were executed in parallel while some had to be executed in a sequential manner. When these jobs were executed on the all purpose cluster, there was no delay. When we changed over to the job cluster, we found an increased in execution time.

After much debate and experimentation, we made a hard decision. We decided to change the master job. Instead of it executing other jobs, we decided to have it implement the task that the underlying jobs were executing. Let us consider and example. Let us assume that our staging to Bronze job had three tasks, the Bronze to Silver had two tasks and the Silver to Gold had one task. The master job had three tasks - one for staging to Bronze, one for Bronze to Silver and one for Silver to Gold. After the change, the master job had six tasks.

After making the change, we executed the new master job and were happy to note a significant gain in performance. Earlier, the master job was taking 45 minutes for execution. After the change, it took 25 minutes.

Now comes the change management part. For around six months, the support team was accustomed to the master job executing other jobs. After this change, that was no longer the case. The master job did no execute other jobs. As a backup, we decided to retain the 'child' jobs for a few days / weeks. I knew that we had to prevent accidental execution of the 'child' jobs. Hence I disabled them by changing on the widget parameters. By changing the name of the parameter, the job would throw an error in the first step and no harm would be done to the data.

Was this even needed? Did we not conduct knowledge sharing sessions with the support team? Yes on both counts. Disabling was necessary.

And it paid off. One fine day, I saw an email message stating that the Bronze to Silver job had failed. I logged into the environment. I found that the master job had failed and the support team had performed a 'repair-run' on the master job. And this time it executed to success. But, someone had also executed the stand alone version of the Bronze to Silver job. As this job had been updated, it failed with an error and that is when the team realized that they had triggered the wrong job. As the job had failed in the first step, it did not harm to the data.

As must be evident by now, people get used to a certain way of working with a system. If the system undergoes a change, updating their 'muscle memory' takes times. Due to the change, people have to make an effort to remember that the sequence of steps has changes. And this can lead them making mistakes - not remember on time that the operating method has changed.

As system designers, it is our duty to ensure that people are properly educated when a change is implemented. That is why change management is important. Additionally, when we introduce a change, we need to find ways that will not harm the system if a user performs the older sequence of steps. Alternately identify steps that will allow us to recover as early and reliably as possible. I also know that it is easier said than done.

#databricks #change_management #job


要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

  • pfff -- why are you spending time to save 16sec execution time

    pfff -- why are you spending time to save 16sec execution time

    In my current project, we are implementing a data processing and reporting application using Databricks. All the code…

    2 条评论
  • Quick Tip - Add a column to a table (Databricks)

    Quick Tip - Add a column to a table (Databricks)

    As the saying goes, change is the only constant, even in the data space. As we design tables for our data engineering…

  • Friday Fun - Reduce time of execution and face execution failure

    Friday Fun - Reduce time of execution and face execution failure

    In my project that has been executing since Dec 2023, things have been going good. We do have the occasional hiccup…