登录查看更多内容

pfff -- why are you spending time to save 16sec execution time

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2024年12月3日

In my current project, we are implementing a data processing and reporting application using Databricks. All the code was developed in the form of notebooks. Even reusable elements / components were developed as notebooks. Where required, we include the notebook (if functionality inside the notebook is required in the calling notebook ) or execute it (if execution has been implemented as a reusable element).

As we were in the phase of development and it was important to delivery functionality, I focused on the deliveries. But I was a bit uncomfortable with some of the constructs used - rather the way they were used.

One of the constructs was used to add entries to an audit table. After delivering many modules, I finally decided to take a closer look at this functionality. I found that the component was developed and executed, it was taking around 17sec to 18sec for execution. After a study of the component and an alternate implementation, I managed to bypass the execution of the component, Even while retaining it behaviour and output. After I implemented and executed the change, I noted that I was able to save 16sec of execution time. Something that took 20sec to execute was now taking 4sec.

At this point in time, I can read your mind . . .

I am sure you are thinking 'you managed to save only 16sec'? Yes, I managed to save 16sec. But that is for one step. If our notebook has multiple such steps, for execution of each step, we will save 16sec. When I applied this change to one job that had three tasks, I managed to reduce the execution time of the job from 7min to 4min. I managed to save 4min.

领英推荐

Guide to Optimize Databricks for Cost and Performance

Analytics8 | Data & Analytics Consultancy 4 个月前

Marvelous MLOps #54. Developing on Databricks (without…

Marvelous MLOps 4 个月前

101 Guide on Apache Airflow Operators

Censius 2 年前

Once again, you may think - only 4min? Yes 4min, but in one job. And we have two score jobs (40) running four times a day. If we believe 4min savings is not much, let us do some math. 4min in one job means 16min per day, because the job is executed four times a day. 16min per day translates to 420min per week, which translates to 7hr per week. 16min per day translates to 5840min per year, which translates to 97hr per year, which translates to 4 days per year. And these numbers are for one job.

In another example, a notebook with 50 steps (of course not all had audit capability), took 18min to 20min for execution when using the earlier method. When the updated code was integrated, the execution time for this notebook came down to 8min to 10min. For this example, we are saving 8min to 10min for each run and the job is executed four times a day.

Finally, all this 'non-execution' in the Databricks environment means that processes finish early and less compute time is used. This translates to hard cash for the customer.

Moral of the story - Check the performance of your application. Even small changes can pay rich dividends.

#tweaking #performance #tuning #time_saving

Koustav Ghosh

Machine Learning/Data Engineering/AWS

2 个月

Bipin Patwardhan that is hard cash for client when you have job clusters and you are able to finish early ..if it is normal clusters then anyway you are paying. The effort is super , but it will yield when we have a scenario of job clusters or server less SQL warehouses. I have seen cost of spinning up and down a job clusters can be overwhelming rather than maintaining a stand by clusters when work loads is big and unpredictable. Sometimes it takes away our ability to give api output on time . Auditing table should somehow be a bulk batch in delta lake using local s3/adls instead of costly row by row insert .

查看更多评论

要查看或添加评论，请登录

Bipin Patwardhan的更多文章

Change management is crucial (Databricks version)

2025年2月22日

Change management is crucial (Databricks version)

My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…
Friday fun - Impersonation (in a good way)

2025年2月14日

Friday fun - Impersonation (in a good way)

All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…
Any design is a trade-off

2025年2月3日

Any design is a trade-off

Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

1 条评论
Quick Tip: The headache caused by import statements in Python

2025年1月22日

Quick Tip: The headache caused by import statements in Python

When developing applications, there has to be a method to the madness. Just because a programming environment allows…
Databricks: Enabling safety in utility jobs

2025年1月13日

Databricks: Enabling safety in utility jobs

I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…
A Simple Code Generator Using a Cool Python Feature

2025年1月2日

A Simple Code Generator Using a Cool Python Feature

For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…
Recap of my articles from 2024

2024年12月17日

Recap of my articles from 2024

As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…
Handling dates

2024年12月9日

Handling dates

Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.
Quick Tip - Add a column to a table (Databricks)

2024年11月26日

Quick Tip - Add a column to a table (Databricks)

As the saying goes, change is the only constant, even in the data space. As we design tables for our data engineering…
Friday Fun - Reduce time of execution and face execution failure

2024年11月15日

Friday Fun - Reduce time of execution and face execution failure

In my project that has been executing since Dec 2023, things have been going good. We do have the occasional hiccup…

See all articles

pfff -- why are you spending time to save 16sec execution time

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

领英推荐

Bipin Patwardhan的更多文章

社区洞察

其他会员也浏览了

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

From Manual to Automated: Migrating Legacy Systems with Databricks

A unified platform with Databricks & dbt

Announcing Tabular

Enhancing Performance and Scalability: Migrating Data Processing to Databricks

Delta Live Tables in Databricks Series —Part 2 — The Architecture of Delta Live Tables

Detailed Guide on DataBricks Delta?Lake- Part 1

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

Databricks vs Snowflake vs Fabric Test Details

Working with Semi-Structured JSON Data in Databricks

领英推荐

Bipin Patwardhan的更多文章

Change management is crucial (Databricks version)

Friday fun - Impersonation (in a good way)

Any design is a trade-off

Quick Tip: The headache caused by import statements in Python

Databricks: Enabling safety in utility jobs

A Simple Code Generator Using a Cool Python Feature

Recap of my articles from 2024

Handling dates

Quick Tip - Add a column to a table (Databricks)

Friday Fun - Reduce time of execution and face execution failure

社区洞察

其他会员也浏览了

Databricks: A Contemporary Solution for Today’s Data Engineering Obstacles

From Manual to Automated: Migrating Legacy Systems with Databricks

A unified platform with Databricks & dbt

Announcing Tabular

Enhancing Performance and Scalability: Migrating Data Processing to Databricks

Delta Live Tables in Databricks Series —Part 2 — The Architecture of Delta Live Tables

Detailed Guide on DataBricks Delta?Lake- Part 1

GroupBy #9: FDAP stack, Iceberg and Hudi ACID Guarantees, Data Driven Management

Databricks vs Snowflake vs Fabric Test Details

Working with Semi-Structured JSON Data in Databricks