登录查看更多内容

Only 3 min saved?

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2023年9月23日

In our project, we have a few (around six) aggregated tables where we perform truncate and load each day. For this discussion, one table is enough. The table in question - say X - was executing for 11 minutes. Data was starting from 2019.

Looking at the aggregation, I asked the question, Do we need to perform aggregation for the whole table each day? If the data was not being updated, aggregating the data over and over would have no effect.

To test this conjecture, I asked my team member to take the existing data, drop data for the current year and keep it handy. Then we performed the aggregate operation on the data only for the current year. This was followed by appending the result to the previous data. I had high expectations for this method.

When we executed this method, it took eight minutes. So we saved three minutes. Not much!!!

So we decided to change our approach and change the window from current year to current month. This time, the query took seven minutes!!! What luck.

Before you scoff at the time saved, let me paint the larger picture. Three minutes saved for each run. That means 90 minutes saved in one month. That means 18 hours saved in one year. How's that for time saved?

And we have five to six aggregations in the pipeline. Even three minutes per query means at least 15 minutes per pipeline per day.

So a penny saved is a penny earned. And how's that for sustainability?

#queryoptimization #optimization #aggregation #subsainability

Mysore Vamsi Krishna

1 年

Not sure what your trying to address. Still truncate and load tables exists in this ERA? . Might be client does not know , that's a very bad approach.You can have a view and expose the latest and more over I don't know what type of aggregations you guys are having. If you properly design your Target table to have some extra columns like year_start, year_end , cur_aggr_flg this could improve your overall solution and aggregate monthly and save it.Once you reach 12months , update the previous year end_year to the current year and the the aggregated values and delete all other instructions or move it to historical table . So that if needed you can reconcile at any point.

Koustav Ghosh

Architect (AWS/ ML/Data)

1 年

Add to that- when partition on year,in your improved approach, query will still take same time(assuming each year net net same volume of data) ..whereas in previous case execution time will go on and on .. So your solution guarantees no performance degradation. So your solution was inflation proof. Kudos. Alas, lot of people don't understand this kind of impact.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Only 3 min saved?

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

更多精彩文章

社区洞察

其他会员也浏览了

Managing tickers for your data

Case when and then || SQL

Data Sense 101: Estimations and Plans

Demo Data: The Importance of Problems, Gaps, and Exceptions

Changelog | 14th June 2024

RMSE (Root Mean Square Error) and MSE (Mean Square Error)

One Dataset, 100 Visualizations

Alternative Data Visualization Solutions!

Data = Value, or 'Lost in Translation'

Friday Fun - Reduce time of execution and face execution failure

2024年11月15日

Simple tools can also be powerful aka Power of Unix utilities

2024年11月7日

Databricks - problematic column names

2024年11月1日

A lot depends on which side of the table you are

2024年10月29日

Points to Ponder - GenAI tools are a co-pilot

2024年10月14日

Points to ponder - Python - spaces vs tabs - unexpected issue

2024年10月9日

Databricks - Making a true copy of a table having an auto-generated identity column

2024年10月3日

Reducing run time by 10min only?

2024年9月23日

Points to Ponder - Even with GenAI at your disposal, you need to know how to use its power

2024年9月17日

Databricks: Conditional execution in a job using if-else

2024年9月12日