pfff -- why are you spending time to save 16sec execution time
In my current project, we are implementing a data processing and reporting application using Databricks. All the code was developed in the form of notebooks. Even reusable elements / components were developed as notebooks. Where required, we include the notebook (if functionality inside the notebook is required in the calling notebook ) or execute it (if execution has been implemented as a reusable element).
As we were in the phase of development and it was important to delivery functionality, I focused on the deliveries. But I was a bit uncomfortable with some of the constructs used - rather the way they were used.
One of the constructs was used to add entries to an audit table. After delivering many modules, I finally decided to take a closer look at this functionality. I found that the component was developed and executed, it was taking around 17sec to 18sec for execution. After a study of the component and an alternate implementation, I managed to bypass the execution of the component, Even while retaining it behaviour and output. After I implemented and executed the change, I noted that I was able to save 16sec of execution time. Something that took 20sec to execute was now taking 4sec.
At this point in time, I can read your mind . . .
I am sure you are thinking 'you managed to save only 16sec'? Yes, I managed to save 16sec. But that is for one step. If our notebook has multiple such steps, for execution of each step, we will save 16sec. When I applied this change to one job that had three tasks, I managed to reduce the execution time of the job from 7min to 4min. I managed to save 4min.
领英推荐
Once again, you may think - only 4min? Yes 4min, but in one job. And we have two score jobs (40) running four times a day. If we believe 4min savings is not much, let us do some math. 4min in one job means 16min per day, because the job is executed four times a day. 16min per day translates to 420min per week, which translates to 7hr per week. 16min per day translates to 5840min per year, which translates to 97hr per year, which translates to 4 days per year. And these numbers are for one job.
In another example, a notebook with 50 steps (of course not all had audit capability), took 18min to 20min for execution when using the earlier method. When the updated code was integrated, the execution time for this notebook came down to 8min to 10min. For this example, we are saving 8min to 10min for each run and the job is executed four times a day.
Finally, all this 'non-execution' in the Databricks environment means that processes finish early and less compute time is used. This translates to hard cash for the customer.
Moral of the story - Check the performance of your application. Even small changes can pay rich dividends.
#tweaking #performance #tuning #time_saving
Machine Learning/Data Engineering/AWS
2 个月Bipin Patwardhan that is hard cash for client when you have job clusters and you are able to finish early ..if it is normal clusters then anyway you are paying. The effort is super , but it will yield when we have a scenario of job clusters or server less SQL warehouses. I have seen cost of spinning up and down a job clusters can be overwhelming rather than maintaining a stand by clusters when work loads is big and unpredictable. Sometimes it takes away our ability to give api output on time . Auditing table should somehow be a bulk batch in delta lake using local s3/adls instead of costly row by row insert .