Only 3 min saved?

In our project, we have a few (around six) aggregated tables where we perform truncate and load each day. For this discussion, one table is enough. The table in question - say X - was executing for 11 minutes. Data was starting from 2019.

Looking at the aggregation, I asked the question, Do we need to perform aggregation for the whole table each day? If the data was not being updated, aggregating the data over and over would have no effect.

To test this conjecture, I asked my team member to take the existing data, drop data for the current year and keep it handy. Then we performed the aggregate operation on the data only for the current year. This was followed by appending the result to the previous data. I had high expectations for this method.

When we executed this method, it took eight minutes. So we saved three minutes. Not much!!!

So we decided to change our approach and change the window from current year to current month. This time, the query took seven minutes!!! What luck.

Before you scoff at the time saved, let me paint the larger picture. Three minutes saved for each run. That means 90 minutes saved in one month. That means 18 hours saved in one year. How's that for time saved?

And we have five to six aggregations in the pipeline. Even three minutes per query means at least 15 minutes per pipeline per day.

So a penny saved is a penny earned. And how's that for sustainability?

#queryoptimization #optimization #aggregation #subsainability

Mysore Vamsi Krishna

Senior Data engineer |S3|Glue|CloudFormation|Python|Spark|AWS|Airflow|EMR|AWS|SnowFlake| Certified Solution architect Associate | Think beyond the code , think about the solution | I use AI regularly ??

1 年

Not sure what your trying to address. Still truncate and load tables exists in this ERA? . Might be client does not know , that's a very bad approach.You can have a view and expose the latest and more over I don't know what type of aggregations you guys are having. If you properly design your Target table to have some extra columns like year_start, year_end , cur_aggr_flg this could improve your overall solution and aggregate monthly and save it.Once you reach 12months , update the previous year end_year to the current year and the the aggregated values and delete all other instructions or move it to historical table . So that if needed you can reconcile at any point.

回复
Koustav Ghosh

Architect (AWS/ ML/Data)

1 年

Add to that- when partition on year,in your improved approach, query will still take same time(assuming each year net net same volume of data) ..whereas in previous case execution time will go on and on .. So your solution guarantees no performance degradation. So your solution was inflation proof. Kudos. Alas, lot of people don't understand this kind of impact.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了