How to identify SQL candidates for optimization ?

How to identify SQL candidates for optimization ?

Poor SQL queries are the bane for any Data Engineering team...

How do you identify candidate SQLs to optimize?

I have four lenses that I use for selecting such SQLs to optimize:

1. Query Runtime: this is the most obvious one. For operational systems I have seen runtime SLA of 30, 60 and 90 minutes work pretty well. For some big data use cases, this could go further up depending on the complexity/size of the data and our need to crunch that in a time versus cost optimized manner. Adhoc SQLs could even have shorter query run time expectations, 5 min anyone ? Totally depends on the complexity but let's make sure we kill zombie queries sucking cluster resources.

2. Query Cost (in $ terms): I used to keep a weekly report of the top 10 expensive queries for my team to target for optimization and resultant cost reduction. This needs a fine balance between business needs, good DE practices and long term infrastructural planning and optimization. I have used different metrics over the years like fixed dollar amounts per run/month or more relative metrics like top N share terms of weekly, monthly costs. I have also used scheduled versus Adhoc costs to be more flexible for short term needs versus long term cost.

3. Operational Excellence: This is the "cost of keeping a pipeline running" as a function of team resources - Oncall, fixing broken job runs, the cost of incremental changes, expensive backfills etc. A badly written SQL powering a critical pipeline can suck the life energy of a DE team. Its worth investing time and energy in Identifying such problem pipelines and fixing them once rather than die by a thousand cuts.

4. Business Needs: This one covers a broader spectrum of needs but primarily covers the use case of the data. Do we need a job to run under X minutes to support a critical business process? Does this data need to arrive by Y time to power downstream dependencies etc. I have used metrics like Daily/Weekly/Monthly on time job completion to measure data arrival/completeness baselines and then adopt SLAs within the acceptable variance/margin. Queries?failing the SLAs are prime candidates for optimization. This is directly tied to run times but the difference is what drives the need for a certain SLA - a business need rather than just good DE housekeeping.

Let me know what how do you identify SQL candidates to optimize...

Elizabeth Stitt

Business Report Developer

2 年

I've had queries that were measured in hours. One process took 32 hours and it was a weekly report. I had to use brute force analysis to figure out the bottlenecks. The primary culprit was a join to a foreign currency table. I didn't need all of the currency values I just needed today's range. I probably could have used CTEs, but I wasn't familiar with them then. A temp table for just today's values gave me the performance that turned it from several days to create it to a couple of hours. The original analyst would run this manually and sit and wait until it was done, I changed it into an automated process. Sometimes a bottleneck is a person who is happy with the status quo. Those bottle necks can be a bit harder to find. It's one of those situations where this is what was given to me,, it's always worked this way, and there's no data curiosity with the analyst on how something might be improved.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了