SQL Server Joins - 2 SARGs
SQL Server Join Costs, 2 SARG
The previous?Join costs?covered join with a search argument on one source only at DOP 1. Parallel plan joins were discussed in?Parallel Plan Cost?at DOP 2. Now it is time to look at joins with search arguments on each source.
With a search argument on only one source, the Loop join can use the join condition as the search argument for the second table as inner source. Merge and Hash joins process each source separately and the table without a search argument must be accessed with a scan.
With a search argument on each source, the Merge and Hash join can be accessed with a seek. The second search argument can be explicitly specified in the WHERE clause or it can be implied from one of the join conditions being an equality to the explicit SARG on the other table. In addition to enabling Merge and Hash joins without having to scan one source, the two SARG scenario also has implications for the Loop join. In the previous Join article, we saw how the Loop join had a high plan cost from the assumption of random I/O for inner source accesses. Depending on the exact circumstances, it is possible for the loop join to have much lower plan cost if the join condition implies that inner source rows are accessed sequentially.
This is the fifth article in the Plan Cost versus Actual series beginning with?Plan Cost Cross-over, and?Key Lookup, followed by the Join and Parallel Plan articles cited earlier. Additional details are in?Loop,?Merge?and?Hash?Joins.
Joins with 2 SARG's
The situation in which a search argument can be applied to each source in a join typically arises in parent-child related tables.
For example, Table P is the parent table. Table A is a child of P, having a foreign key to P. Table B is a child of A. The nature of the application is that rows of A and B belonging to a given entity in A are accessed together.
For this reason, the clustered index and primary key of A is the foreign key to P plus a local value to make the pair unique.
Table B then foreign keys to A on the primary key of A, which has two columns. The clustered index and primary key of B is the foreign key of two columns plus one more column for uniqueness.
In the first query below, there are explicit search arguments for each source in addition to the join condition.
In the second query (above), there is one search argument in the WHERE clause, but the first of the join conditions has the effect of being a search argument that can allow the other source input to be processed independently.
Test Tables
A brief description of the test table is as follows. Table A0 is from the previous articles, with a primary key clustered on a sequentially increasing key AID.
The new tables Ax and Bx, have unique clustered index on GID, SID. GID is a grouping column used as the SARG. SID is a sequence within the group. These tables also have the AID column which is unique and sequentially increasing. We can join the Ax and Bx tables on SID with the same or different GID values.
Loop Join - SARG on Each Source
Before comparing Loop to Merge and Hash joins without a scan, let's evaluate how Loop joins are affected. We should first compare the two SARG (by which I only mean one SARG on each source) Loop join to the single SARG loop join previously covered. Both queries below involve 32 rows and have clustered and nonclustered indexes as appropriate. The first query (later referenced as Loop 1S) below is the single SARG we looked at in the previous?Join costs?article on single SARG joins. The outer source for both queries is A5. This table has a unique clustered index leading with?GID, which is the equality SARG, followed by?SID, the join condition for the second query (Loop 2S). The inner source table for the second query, B5, has the same clustered index.
The A0 table in the first query inner source has a unique clustered index on?AID, the join condition. In this example, the order of AID values is not mandated by the clustered index, and SQL Server does not know the access order.
Below is the execution plan for both queries. Note that the plan cost for the first query, 1S, is 90% while the Loop 2S query is 10% of the combined pair. The difference is all in the inner source access.
Below is the inner source access for the first join (1 SARG). The outer source sends 32 rows, hence the number of executes to the inner source is also 32 (very close).
The component (subtree) cost is:
??32 × 0.01581 (CPU) +?30.0987?× 0.003125 (I/O) = 0.101827 (Subtree Cost)
In other words, there is no assumption each row resides in the same page.
Below is the inner source access for the second join (2 SARG).
The component (subtree) cost is:
??32 × 0.01581 (CPU) +?1?× 0.003125 (I/O) = 0.0081842 (Subtree Cost)
The query optimized knows that each successive inner source row is stored sequentially, and the 32 rows fit within a single leaf level page (101 row per page), so only one I/O is assessed.
(It may not be necessary for the inner source rows to be stored sequentially, only that they are stored within a range?)
Merge and Hash Join with SARG on Each Source
The test tables are defined the SARG (GID) and join condition (SID) together as the unique clustered index (except for the A0 table used in Loop 1S examples). The conditions for a Merge join of both sources being in sorted order are met. It also happens to be a?one-to-one?join. The goal here is a regular merge join, requiring a?one-to-many?condition, of which 1-to-1 is a subset.
If the outer source is not unique on the SARG plus join condition, then a?many-to-many?Merge join could be used, which is not within the scope of this discussion.
If the merge join conditions are met with index enforced order, then the Merge join will always have lower cost than a Hash join (at DOP 1). In order for a Hash join to occur naturally, the SQL Server query optimizer must not know that the inputs are in sorted order. It does not matter whether they are in sorted order or not, just that there are no indexes to guarantee the order.
In the test tables, in addition to the GID and SID columns, on which the unique clustered index is defined, there is also a column BID which happens to have the same value as SID. Any join on either of SID or BID will match on the same rows. However, the query optimizer is not assured of the order when joining on BID.
In the query below, notice join condition is?b.BID = a.SID. This is pertinent for join order Ax?as outer source AND B.x?as inner source. For a regular merge join, the outer source must be known (forced) to be unique.
The query plan below, happens for 512 rows or more (in 1-to-1 joins).
The plan is a loop join up to 64 rows. At 128 and 256 rows, after the inner is accessed, a Sort operation puts the rows in order prior to entering the Merge join as shown in the plan below. The Sort + Merge join is discussed in?Merge Joins?(in progress).
The plan cost difference between the Sort plus Merge join is slightly less expensive than the Hash join at these two row counts.
?At 128 rows, the Sort + Merge plan cost is 0.0262314.
?The Hash join plan cost is 0.0270036, a difference of 0.0007722.
Presumably this is just how the plan cost formulas work. Note, the cost formulas have many dependencies, and expectation variation depending on the size of the join conditions among other differences.
At 256 rows, the plan costs are 0.0308149 and 0.0311713 difference 0.00035635.
Loop, Merge and Hash Join Plan Costs
The plan costs for Loop, Merge and Hash joins are shown below on Log-Log scale. The vertical scale is in plan cost units (allegedly not time) and the horizontal scale is rows.
The Loop join with one SARG (1S) is included to illustrates the difference in cost between inner source accesses that could be any where in the table (in theory, not in practice for this example) versus when accesses are within a limited range of pages specified by the SARG on the inner source (2S). (Loop joins normally have a SARG on the outer source. Not having one could be bad news.)
The points to note are where the Loop (2S) cost crosses the Merge and Hash joins. The Merge join appears to be less expensive than the Loop join at just over 32 rows. The data points are mostly in 2X steps from 1, 2, 4, ... to 262,144. This includes 16 and 32 rows. There are additional data points between 8K and 16K and between 65K and 128K. The cross to Hash join occurs just before 128 rows. However, this does not consider the Sort + Merge join.
Loop, Merge and Hash Join Actual Costs
Our options are 1) Loop, 2) Merge (no sort), 3) Hash and 4) Merge join with Sort. Excluded are the?many-to-many?Merge joins. Just the first three will be discussed for now (all 1-to-1, DOP 1).
Below are the Loop (1S and 2S), Merge and Hash joins actual cost in CPU-milliseconds on the vertical scale versus rows on the horizontal scale. The transaction isolation level is default Read Committed. It is true that Loop 1S actual costs are higher than Loop 2S, but only moderately and not than huge disparity in the Plan cost. The plan cost is driven by the assumption of I/O and there is no option to override.
领英推荐
The graph shows the same data points as the plan cost charts. The regular points are the steps of 2X from 1, 2, ... to 262,144 (powers of 2). And there are additional detail points including 10,240 rows, just above the regular point 8,192.
There is a sharp change in slope for Loop 1S from 1 row to 2 row, with less dramatic change at the same point for Loop 2S and merge. The notches for Merge and Hash joins occurring between 8192 and 10240 rows are probably lock level changes, to be explained shortly.
Other data points include 114,688, close to 131,072 rows (more so on log scale). This shows a notch for the Hash join. This is the transition for both Row mode to Batch mode and Stream Aggregate to Hash Match Aggregate. Presumably the bigger impact of the two is the Row mode transition to Batch mode.
Below are the same Loop (1S and 2S), Merge and Hash joins at transaction isolation level Read Uncommitted, equivalent to NOLOCK. There are no notches at 10K rows for the Merge and Hash joins.
Further investigation shows that the change in behavior at Read Committed occurs between 8704 and 8960 rows. The Row mode to Batch mode transition was found to occur exactly between 129,999 and 130,000 rows.
Note: Dima Pilugin?SQL Server 2019: Batch Mode on Rowstore?says the change occurs between 131,071 and 131,072 rows. Not sure why and if this is version related.
Edit (20-Dec-22): Other further review, the test queries relied on auto-generated statistics, default sample less than full scan. When estimated rows in the join exceeds 131,072 rows, Batch mode is eligible. With FULLSCAN statistics, the transition point is indeed 131071-131072 rows, per Dima.
Cost per Row, Read Committed
Deciphering the measured values can be somewhat tricky. We would like to translate the measured costs into a model. The model should preferably be simple and not more complicated than necessary. It should not be an exercise in pure curve fitting. There should be some rational basis for the model.
The simplest model is a fixed cost to start the execution plan, followed by a cost per row. To expose this, we would first chart the CPU cost per row as shown below at Read Committed isolation.
If the simple model were valid, the incremental cost per row is the value once the curve flattens on reading from left to right. The fixed base cost is the 1 row cost minus the per row cost.
It is apparent from the Loop 1S CPU per row that this is not valid because the cost per row at 2 rows is higher than the cost at 1 row. The table below show the numerical values for CPU-μs at 1, 2 and 4 rows. Notice that there is a significant step up from 1 row to 2 rows followed by a smaller step.
It is as if the 2 row and higher executions have a different fixed cost than the 1 row query. A possible explanation, just speculation, could be that SQL Server knows the 1 row query only returns 1 row, and skips the aggregation?
Using join cost per row in the flat intermediate region from 512 to 8704 rows, we can first estimate the cost per row. Then we subtract the 1 row and 2 row costs to estimate the 1-row and 2-row+ base costs. The base costs and per row costs in CPU-μs are then estimated as follows.
I do not have an explanation as to why Loop 1S 2-row query appears to have a base cost of 22 CPU-μs
while the Loop 2S 2-row query appears to have a base cost of 14.6 CPU-μs.
The first observation from actual costs is that the Merge join has almost no base cost increase over the loop joins (the Loop 2S more comparable than 1S) while the Hash join has a very high base cost. The plan equivalent of the actual base cost is approximately the two index seeks (0.003125 each) present in all three of Loop, Merge and Hash joins, and the individual join components. These are: Nested Loops (0.00000418), Merge (0.0056) and Hash (0.01775). The join plan base costs are then:
The plan base cost for Merge is almost double (1.89) the Loop and the Hash join is 3.84× the Loop. The actual Hash join base cost is approximately 10 × higher than either Loop or Merge joins.
More than twenty years ago, I had done a similar set of measurements on SQL Server 2000, in?Query Costs Part III. The processors were the Pentium III at 600MHz with 256K L2, and 700MHz with 2M L2. The current system is the Core i9-11850K with nominal frequency 3.6GHz and turbo as high as 5.0GHz. The actual operating frequency during tests at the operating system power setting was 4.18GHz. The Comet Lake processor has 256K L2 per core, and combined L3 of 20MB.
The version 2000 results were reported in CPU-cycles. The test methodology then was system saturation with multiple threads, while the current tests are running multiple iterations with a single thread and using worker time reported by dm_exec_query_stats. In the old result, the main impact of the large L2 cache was dramatically lower base cost and very little change in the per row cost. My interpretation was that the larger L2 was able to keep required data in L2 between connections, but once a query started, the smaller L2 was sufficient for incremental rows.
The chart with 1-row and 2-row+ base costs subtracted out is below, again for Read Committed.
The worse fit is the Hash join. If I had to guess, I would say many things are happening in a Hash, considering the 185 CPU-μs base cost. It could be that not all base data structures get into the processor L1/L2 cache in the first two rows. It looks like there is an excess cost of 8μs at 4 rows, and 13-20μs between 8 to 256 rows. over the model.
This model parameter extraction methodology is very simplistic, amounting to eye-ball level estimations. The sophomore approach is to use powerful statistical techniques. On the other hand, there is the old adage:?if you think you need powerful statistical methods, what you really need is better data.?Eyeball methods it is. The advice is to not draw too much from the base cost numbers here.
Cost per Row, Read Uncommitted
The chart below is the Read Committed CPU cost per row in micro-seconds with similar base cost subtracted.
Using the intermediate range from 512 rows to just before 130K rows for Read Uncommitted, the per row costs are as follows.
Most per row costs are 0.3 CPU-μs lower except for Loop 2S which is 0.2 CPU-μs lower. The greater percentage impact is on Merge join, because lowering 0.6 CPU-μs by 0.3 CPU-μs is a 50% reduction in elapsed time for a 2X speedup. Lowering the Loop 1S per row cost of 1.6 CPU-μs by 0.3 CPU-μs a 19% reduction in elapsed time for a 23% speedup.
Note the down step in Merge and Hash join at 10,240 rows. There is an upward notch for Loop join 1S. At read uncommitted isolation, these "anomalies" go away. The down step at 128K (131,072) remains and is correlated to a change in the execution mode Row to Batch.
In Read Committed, the lock level for a query changes between 8704 and 8960 rows. Below the set point, the query should be taking shared row locks. Above the set point the query should take a shared table lock. If this does not work for the environment, consider appropriate measures. Measures are not within the scope of this article.
Note on Loop 1S at higher rows
The cost per row for Loop 1S is about 1.6μs per row at the intermediate row counts, then decreases at very high rows. While the inner source table for the Loop 1S query has 131,072 pages, the outer source query will only touch the first 65,536 pages. Below 64K rows, the query will touch different pages for each row. Beyond 64K rows, the query touches more than one row per page. We should say then that the cost is 1.6μs per row if each row is in a different page at Read Committed and 1.3 at Read Uncommitted.
Hash Join Batch Mode on Rowstore
Batch Mode on Rowstore is clearly an interesting and welcome new feature from version 2019. There might be some overlap in benefits of Batch mode with the Bitmap operator already available from previous versions of SQL Server, but for parallel plans only.
On activation at 130,000 rows, the actual query CPU-μ is reduced by just over a factor of two. The tests here showed activation at 130,000 in a 1-to-1 join, both sources with data distribution statistics indicating exactly that number of rows. We might ask why activate Batch mode only at 130,000 rows if there might be benefits at lower number of rows?
It is stated somewhere that Microsoft implemented this feature to benefit large queries with run times on the order of whole seconds. In the test query here, the change from Row mode to Batch mode at 130,000 rows reduced CPU time from 65 milliseconds to 30 milliseconds. Plan cost was reduced from 4.257 to 3.203, both below the default Cost Threshold for Parallelism of 5. It is very plausible that Microsoft did not consider sub-65 millisecond queries having plan cost lower than 5 to be sufficiently important to accelerate. The is below the user perceptible level.
That said, we might be very interested in the performance gains of Batch mode for throughput performance, not just user perceptions purposes. Note that tests here were single thread, multiple iteration runs. A full system saturation test with multiple concurrent connections was not done (this was done for the SQL Server 2000 results).
The test tables here had identical structure. Each pair of tables have a specific data distribution for all GID group values. We can enable or disable Batch Mode on Rowstore, but the SQL Server query optimizer makes its own decision if Batch Mode is enabled, and the database compatibility level is 15 or higher? In version 2022, the decision point seems to be an estimated number of rows of 130,000. (I have not evaluated mixed inner/outer source row variations.)
In the test environment, if we want Batch mode at lower row count, then this could be done by grafting the statistics from an appropriate table to the lower row count table. This can be done using?DBCC SHOW_STATISTICS?with the?STATS_STREAM?option. Example below.
The?UPDATE STATISTICS?with the same?STATS_STREAM?option.
There are some matters on the Rowcount and Pagecount. At 100% statistics, the either header value or stats stream value takes precedence. This was a discussion I had with T. Kejser about ten years ago and the details have been affected with the application of Sazerac 18.
In testing the transferred statistics, Batch mode was found to yield 50% reduction down to 16,384 rows. This was the lowest row tested. CPU was reduced from about 8,000 μs to 4,000μs.
In a real production table, we expect a range of distribution values. We could use the OPTIMIZE FOR hint to point at a value having more than 130,000 rows. But there is the memory grant impact. As common-sense dictates, test this thoroughly before even thinking about using it in production if at all. I take no responsibility for any effects based on advice taken from this article. Do your own research.
Summary
The main topic here was joins with search arguments on each source. This is likely to be applicable between tables having true parent-child relations and groups of rows are accessed for a common parent entity. When the conditions are valid, it is possible to construct tables with the correct unique clustered indexes. When the tables have been constructed in this manner, then queries can be written with search arguments to both sources along with the join conditions. It might be that the query is valid with one search argument and the join condition. Under these circumstances, the query optimizer knows join accesses are to rows in a limited range. In this case, if the plan is a loop join, the I/O component will not be assigned an excessively high cost value. Furthermore, a merge or hash join is possible without a scan to the inner source. Otherwise, when the join has only one search argument, the plan may prematurely shift to a scan having much higher cost than if stayed with a loop join. The consequences can be severely negative. Avoiding the bad effects involves architectural planning.
References
Dima Pilugin?SQL Server 2019: Batch Mode on Rowstore
Microsoft SQL Server Team?Introducing Batch Mode on Rowstore
Database Administrator - Consultor at MartinDba
2 年I now, we now, you are a big Guru. I follow you Long time ago and and, everey time I learn something new. But, my suggestion, and a lot of People is : write something about :how to read a execution plan from the begining. Unfortuny my lack of english can't do that. Best regards.