pfff -- why are you spending time to save 16sec execution time

In my current project, we are implementing a data processing and reporting application using Databricks. All the code was developed in the form of notebooks. Even reusable elements / components were developed as notebooks. Where required, we include the notebook (if functionality inside the notebook is required in the calling notebook ) or execute it (if execution has been implemented as a reusable element).

As we were in the phase of development and it was important to delivery functionality, I focused on the deliveries. But I was a bit uncomfortable with some of the constructs used - rather the way they were used.

One of the constructs was used to add entries to an audit table. After delivering many modules, I finally decided to take a closer look at this functionality. I found that the component was developed and executed, it was taking around 17sec to 18sec for execution. After a study of the component and an alternate implementation, I managed to bypass the execution of the component, Even while retaining it behaviour and output. After I implemented and executed the change, I noted that I was able to save 16sec of execution time. Something that took 20sec to execute was now taking 4sec.

At this point in time, I can read your mind . . .

I am sure you are thinking 'you managed to save only 16sec'? Yes, I managed to save 16sec. But that is for one step. If our notebook has multiple such steps, for execution of each step, we will save 16sec. When I applied this change to one job that had three tasks, I managed to reduce the execution time of the job from 7min to 4min. I managed to save 4min.

Once again, you may think - only 4min? Yes 4min, but in one job. And we have two score jobs (40) running four times a day. If we believe 4min savings is not much, let us do some math. 4min in one job means 16min per day, because the job is executed four times a day. 16min per day translates to 420min per week, which translates to 7hr per week. 16min per day translates to 5840min per year, which translates to 97hr per year, which translates to 4 days per year. And these numbers are for one job.

In another example, a notebook with 50 steps (of course not all had audit capability), took 18min to 20min for execution when using the earlier method. When the updated code was integrated, the execution time for this notebook came down to 8min to 10min. For this example, we are saving 8min to 10min for each run and the job is executed four times a day.

Finally, all this 'non-execution' in the Databricks environment means that processes finish early and less compute time is used. This translates to hard cash for the customer.

Moral of the story - Check the performance of your application. Even small changes can pay rich dividends.

#tweaking #performance #tuning #time_saving

Koustav Ghosh

Machine Learning/Data Engineering/AWS

2 个月

Bipin Patwardhan that is hard cash for client when you have job clusters and you are able to finish early ..if it is normal clusters then anyway you are paying. The effort is super , but it will yield when we have a scenario of job clusters or server less SQL warehouses. I have seen cost of spinning up and down a job clusters can be overwhelming rather than maintaining a stand by clusters when work loads is big and unpredictable. Sometimes it takes away our ability to give api output on time . Auditing table should somehow be a bulk batch in delta lake using local s3/adls instead of costly row by row insert .

回复

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

  • Quick Tip - Add a column to a table (Databricks)

    Quick Tip - Add a column to a table (Databricks)

    As the saying goes, change is the only constant, even in the data space. As we design tables for our data engineering…

  • Friday Fun - Reduce time of execution and face execution failure

    Friday Fun - Reduce time of execution and face execution failure

    In my project that has been executing since Dec 2023, things have been going good. We do have the occasional hiccup…

社区洞察

其他会员也浏览了