SQL Server Parallel Plan Cost

SQL Server Parallel Plan Cost

SQL Server Parallel Join Costs

Up until the early 2000's, microprocessor vendors focused on improving the performance of the processor core. This was the era before multi-core processors. By that time, it became too difficult to achieve significant performance gains in the core itself with each architecture generation. From then on, most of the processor development was first to dual-core, and then progressively increasing the core count.

While the modern multi-core processor is an incredible entity, there are important implications. We do have truly astonishing throughput performance in the high core-count processors. What we do not have is an infinitely fast single core. The full capability of the modern processor can only be achieved with parallel processing. For SQL Server and other relational databases, parallel execution plan effectiveness is now one of the cornerstones.

Here, we will look at the Loop and Hash joins at DOP 2. Plan and actual costs are compared to the same join at DOP 1 explored in?Join Cost. Details of the?Loop Join?and?Hash Join?are discussed separately so this article can focus on cost versus rows.

Parallelism

Before jumping into Parallelism data, it is helpful to discuss expectations. Ideally, a parallel execution plan should divide work uniformly over multiple threads. If we knew how to do this ahead of execution, then this objective is met. In practice, often we cannot quantify the amount of work required for each subset, nor do we know to that core running each thread has the same degree of availability.

In addition to uniform distribution of work, we would like the parallel plan to not incur significant extra work over the serial (non-parallel) plan as much as possible. Some extra work must be necessary, specifically, consolidating the results of individual worker threads into the final result.

Suppose for example, we have a query that consumes 1000 CPU-milliseconds at DOP 1, with elapsed time of 1000ms as well (no I/O, no network transmission delays). In the ideal scenario, at DOP 2, two worker threads run for 500ms each, resulting in total CPU time of 1000ms and elapsed time of 500ms. The query incurs no significant CPU penalty from parallelism and speeds up by nearly 2X.

One less than ideal example is: the parallel plan requires more CPU, say 20% for a total of 1200 CPU-ms. The work is still divided evenly between two threads. Each thread runs for 600ms, which is the total query time as well. The speed-up is then 1.67X.

A second less than ideal example is if there is no extra work, but work is not divided evenly. One thread runs for 600ms and the other 400ms. Total CPU is still 1000ms, and the query time is 600ms for the same 1.67X speed-up. It is very possible that there will be extra CPU for the parallel query and work is not divided evenly.

Merge Join Plan Cost

There is no parallel Merge - mostly. One would think that a parallel merge join is practical and that it should be possible to write reasonably efficient code for this? Of course, there is what one might think and what one might do.

However, there is something wrong with parallel Merge join code implemented by SQL Server in having extraordinarily high actual execution cost. SQL Server suppresses this by assigning a very high cost to parallel merge operations. The parallel Merge join may still occur for queries that Insert to a table having the same clustered index order as the SELECT query.

Plan Cost - Hash Join, Row and Batch, at DOP 1 & 2

As discussed in?Join Costs?(LinkedIn?version), the Hash join becomes more interesting in parallel execution plans, or rather, the change from serial (DOP 1) to parallel. See?Hash Join for details of the Hash join plan in these scenarios. The row mode parallel hash join plan is below.

No alt text provided for this image

Note the presence of the Parallelism (Repartition Streams) operators to the left of each source access, Index Seek and Clustered Index Scan. This is one example of extra work in a parallel execution plan. Extra work does is not always manifested in a plan operator. Also, in the parallel plan but not the serial, is the Bitmap, which a Bitmap filter, not Bitmap index.

Below is the batch mode parallel hash join plan. Neither Repartition Streams nor Bitmap operators are present. However, the Clustered Index Scan details does mention a Probe Bitmap predicate.

No alt text provided for this image

Below is the Hash join plan cost at DOP 1 and 2 for both Row and Batch modes. The vertical drop at 130K rows represents the transition from Row to Batch mode.

No alt text provided for this image

At DOP 1 row mode, just more than half of the plan cost comes from the inner source scan of 129,775 pages (I/O cost 96.11, CPU Cost 14.42). Just under half is from the Hash Match operation at 85.68 for low row counts.

At DOP 2 row mode, the scan I/O remains 96.11, scan CPU reduces to 7.21. The Hash CPU reduces to 42.85. The inner source Parallelism (Repartition Streams) processing 13M rows adds cost 25.17 while the other Parallelism operators adds around only 0.0286, and the same for the Bitmap. In all, the DOP 2 row mode plan is 13% lower than the DOP 1 plan.

For the batch mode plans, the Hash Match operator has very low cost, so the only meaningful reduction comes from the inner source scan CPU component. On the plus side, there are no Repartition Streams operators. Neither is there a Bitmap operator shown in the plan, although the Clustered Index Scan detail does mention a Bitmap.

Actual Cost - Hash Join Row and Batch, DOP 1 & 2

With some understanding of the SQL Server execution plan cost model for parallel joins, it is now time to look at actual costs. Recall that plan cost is an estimate of elapsed time to run a query on some reference system and "is not a unit of time" even though long ago it was a unit of time. Obviously, the specific hardware configuration has very little memory that leaf level pages are not in buffer pool and a weak storage system in which a single thread saturates I/O. But we digress.

In examining actual query cost for parallel plans, we are interested in both CPU time and elapsed time. The objective is to reduce run time as much as possible for the degree of parallelism, while not incurring too much additional CPU for extra work that must be done in the parallel plan.

Below is the Hash join Actual CPU-ms at DOP 1 and 2 for Row mode and Batch mode. The separate color indicates when Batch is active. As in the plan cost, the vertical drop represents the transition from Row mode to Batch mode. The continuation with the original color indicates plan costs for Batch Mode disabled.

No alt text provided for this image

We can see that for Row mode, CPU does not increase from DOP 1 to DOP 2. Rather, it drops and drops in a huge way. While the?Bitmap?operator is shown to the left of the build (outer) source "but the actual bitmap checks (filtering of rows) is done within the Parallelism operator that is on the probe side of the Hash Join". It is possible what was meant is that the filtering is done at the (probe) inner source access before going to the Parallelism operator, as the Clustered Index Scan does mention Bitmap in the details.

The intent of Bitmap filtering is to reduced cost, but there is no reduction in the plan cost. There is a cost shown for the Bitmap operator, the same shown for the Parallelism operator that precedes it (to the right). But the Subtree Cost in the Bitmap does add the operator cost to the previous operation.

The effect of the Bitmap filtering in this example is huge. Given that there are 132K rows from the outer source, and the table scan involves 13M rows, a significant impact seems reasonable. Recall from?Key Lookup & Scan?(LinkedIn?version), that the (DOP 1) scan actual cost was 650 CPU-ms. So, how can it be that the Parallel Hash Join which does the same scan have lower cost, which at lower (outer source) row counts can be 528 CPU-ms. This is not the only mystery in SQL Server, and we may revisit this matter. At higher outer source rows, the parallel, row mode Hash join does become more expensive.

The Batch Mode characteristics shows little meaningful differences in CPU time between DOP 1 and DOP2. This is consistent with the plan not having the Parallelism Repartition Streams operator, which is usually the source of significant extra work in parallel plan. The Batch Mode plan does not show an explicit Bitmap operation at either DOP 1 or 2. It does indicate Bitmap filtering happens in the inner source access for both DOP 1 and 2.

Below is the elapsed time for the Hash Joins at DOP 1 and 2, again with separate color for when Batch Mode is active. To a large extent, the elapsed time is one-half the CPU time, indicating that the work is evenly split between the two worker threads and the time spend in single thread operations is minimal. In some cases, elapsed time might be as high as 60% of the CPU time, still good.

No alt text provided for this image

It might seem that row mode Bitmap filtering is too good to be reserved for parallel plans only. But here we have restricted parallelism for a plan with cost over 200. It is not a stretch to believe that a hash join that would benefit from Bitmap filtering should be a parallel plan, and the choice to only activate for parallel plans is a reasonable one. We note that there are situations in which a parallel plan is inhibited, the presence of a user defined function (UDF) for example. The correct response is to take extraordinary measures to ensure parallel plans are not inhibited. Furthermore, it is more important to not raise the Cost Threshold for Parallelism too high. CTFP higher than the default of 5 is recommended, just not too high.

The indications so far are that Batch mode is really great. But what is the threshold for activating batch mode? Can it be lowered?

Actual Cost - Loop and Hash Joins, DOP 1 & 2

Now that we have discussed serial and parallel (DOP 2 only) Hash joins for both row and batch mode, we can now see how these compare with Loop joins, also at DOP 1 and 2.

Below are CPU-ms for Loop join at DOP 1, and Hash joins at DOP 1 and 2. The Loop join, as mentioned earlier, does not have a parallel plan until there are more than 10K rows, which is at the extreme left of the chart horizontal axis.

No alt text provided for this image

We can see that there is some level of overhead for the parallel loop join actual CPU time at higher row counts, visible with gap from 400K rows. The difference can be 10%

Below is elapsed time in milli-seconds for Loop, and Hash (Row and Batch) joins at DOP 1 and 2.

No alt text provided for this image

As discussed for DOP1, the Hash join in Batch mode only becomes quicker to execute at over about 730K rows. This is in contrast to the Plan cost cross-over at around 112K rows. At DOP 2 the Batch mode hash join runs faster than the loop join at over 640K rows while the plan cross-over about the same as for DOP 1.

Parallel Loop Join

Additional discussion on the parallel loop actual cost is warranted. As mentioned earlier, Loop Join plan costs has been covered in a separate article,?Loop Join. In brief, the loop join first shows a parallel plan at around 10,241 rows with plan cost 32.46725 for DOP 2. The serial (DOP 1) plan cost is 32.4689 for a difference in plan cost 0.00169 or 1 part in nineteen thousand. The SQL Server Cost Threshold for Parallelism has not been changed from default of 5. SQL Server will choose a parallel if there is any difference in plan cost?

The data point in the charts of the previous section with a parallel plan for the Loop join occurs at 16,384 rows. There is no meaningful difference in either CPU or elapsed time between DOP 1 and 2. Below is the actual plan with runtime for 16,385 rows. The query runtime shown is 30ms. The average of multiple runs without plan is 26.987ms.

No alt text provided for this image

Below is the actual rows per threads at DOP 2 for 16K rows. Obviously, to get parallelism gain, we actually have to have work divided among threads.

No alt text provided for this image

Below is the actual rows per threads at DOP 2 for 32K rows. At 64K rows, the split is 28,288 and 37,249.

No alt text provided for this image

It is puzzling why SQL Server is having difficulty is getting a somewhat even split in this case. The Query Optimizer compile process knows through data distribution statistics the expected number of rows. It knows how many pages of the nonclustered index must read for the index seek, shown below.

No alt text provided for this image

The index in question has 449 rows per page, so 16385 rows requires 36.49 pages. That is how it knew to calculate the I/O cost as 0.003125 for the first page and 0.00074074 for each subsequent page to total 0.0294163.

One reasonable strategy for DOP 2 might then be for the first thread to take the first 19 nineteen pages matching the SARG and the second thread take the remaining pages.

If SQL Server had a model for the cost of operations with data already in the buffer pool, it might make use of the knowledge that each Loop join inner source access (or key lookup) takes about 2 CPU-μs (1.7μs in our example). A round-robin strategy might divide work into chunks of four pages from the nonclustered index leaf level, corresponding to 1,796 rows for full pages. Each thread grabs one chunk. At 1.7μs per loop join inner source access, the chunk takes 3ms. In this strategy, we want to have only one thread access a leaf level page as much as possible. So, we divide work by wholes pages instead of rows in which the end pages may be straddled by two threads.

To summarize Loop joins and Parallelism, there are two main comments. One, because of the cost assigned to the expected (modeled) random I/O of Loop join inner source access, the loop join gets a parallel plan at relatively low cost in actual work for data already in the buffer. Second, Parallel Loop join effectiveness is degraded by highly uneven distribution of work between threads.

Cost Threshold for Parallelism (CTFP or CTOP)

Most people agree plan cost 5, perhaps appropriate for the low hundred MHz processors of 1998, is too low in the modern era of multi-GHz processor cores. Also, the nature of processor/cores between these eras are hugely different that frequency alone is not a good comparison. There is no clear value of CTFP that is good for all cases, some queries really do benefit at 20-30 while others should be deferred until 50-80 plan cost units.

That said, we can discuss the strategy for division of work between threads. For the loop join query in question, the outer source Index Seek operation references indexes with 449 rows per page. At 16K rows, these are in a span of 36.5 (index) leaf level pages, and 73 pages for 32K rows. It should not have been too difficult for SQL Server to figure out an appropriate division of work in this plan?

Parallel Overhead and Speedup

The next topic is evaluating the effectiveness of parallel execution. There are two or three metrics. First is the CPU overhead, the extra CPU necessary to execute the parallel plan over a serial plan. Second the speed-up, how much faster is the parallel plan relative to the serial plan. A third metric is whether work is evenly distributed among threads. Knowing how to distribute evenly prior to execution works if all cores (or logical processors) are identical and equally free. There is a potential complexity in this if some resources are not uniformly available or if cores/threads are unequal. The example being the Intel Alder Lake processor with P and E cores.

This article is only a basic discussion of parallelism. The more complex discussion would entail how to support parallelism on a NUMA system. Or we could just admit that it is time to give up on NUMA and stick to single socket systems.

Below is the DOP 2 CPU relative DOP 1 CPU. represent CPU overhead for Loop and Hash joins. For the Loop join, parallelism begins at 16K rows, although work is only distributed over more than one thread at 32K rows and higher.

No alt text provided for this image

The Loop join at DOP 2 actually runs with less CPU between 32K to 64K+ rows. It is uncertain how to attribute this?

At 132K rows and higher, the DOP 2 plan consumes about 10% higher CPU. It was already noted in the?Loop Join?article that the plan cost also exhibited a bump in cost at 132K rows. This was attributed to a difference in the plans, with the DOP 1 plan switching to Hash Match aggregation while the DOP 2 plan stayed with Stream Aggregate.

No alt text provided for this image

The Batch mode hash join exhibits idea CPU characteristics, the same CPU for each of DOP 1 and 2. This is reflected in the plan with no Parallelism Repartition Streams. Each thread does it's own work with nearly zero messaging between threads while doing the work.

The Row mode hash join, we have already discussed. This has better than "ideal" characteristics in that CPU actually drops dramatically going from DOP 1 to 2. This was attributed to the use of the Bitmap filter, not available to DOP 1 hash joins. The reduction in CPU decreases at higher row counts because fewer rows are filtered out.

Below is the speedup for Loop and Hash joins from DOP 1 and 2 (DOP 1 time divided by DOP 2 time).

No alt text provided for this image

Speedup largely correlates with CPU. If there is no extra CPU for the parallel plan, then the time to execute is halved going from DOP 1 to 2 if work can be evenly distributed.

For the Loop join at 16K and 32K, work was not evenly distributed. At 16K, all rows went to one thread and zero to the other. At 32K, 28.3K went to one thread and 4.5K to the other.

Summary

A basic understanding of parallel execution in SQL Server has been presented. We have identified CPU and elapsed time as metrics to note at between serial (non-parallel, DOP 1) and parallel plans and at varying levels of parallelism. In addition, we should be alert for differences in the serial and parallel plans. The presence of the Bitmap filtering regardless of whether operator is explicit is important. The presence of the Parallelism operator may indicate significant additional work for the parallel plan, or it could be a false indicator? It is also possible for the plan to take structurally different forms aside from parallelism operators, between serial or different levels of parallelism even though this is not discussed.

It should be apparent that the plan cost model is completely unable to predict parallelism performance cost (extra CPU) and gains (reduced duration). The existing plan model is also somewhat inadequate at controlling when to consider parallelism due to the assignment of random versus sequential I/O. A topic to be covered in the future is a strategy to assess the effectiveness of parallelism and how to decide on degree of parallelism prior to execution instead of relying on feedback.

要查看或添加评论,请登录

Joe Chang的更多文章

  • US Population from Neilsberg vs SSA (Elon)

    US Population from Neilsberg vs SSA (Elon)

    update : data from census.gov, not sure if actual or estimated, additional bracket for 80-89, 90-99 and 100+.

    1 条评论
  • single to multi-socket scaling

    single to multi-socket scaling

    Lenovo and AMD recently published TPC-E benchmark result for the 2-socket EPYC 9554. Most recent AMD TPC-E results have…

    5 条评论
  • SQL Server Performance from Intel Comet Lake to Raptor Lake

    SQL Server Performance from Intel Comet Lake to Raptor Lake

    A year ago, I reported on the performance characteristics of basic SQL Server operations (L2 Cache Size…

  • Insert, Update and Delete Tricks

    Insert, Update and Delete Tricks

    The previous articles Insert, Update and Delete Plan Cost, Asynchronous IO and Storage Arrays and IUD Performance…

    1 条评论
  • Filtered Statistics Tricks in SQL Server

    Filtered Statistics Tricks in SQL Server

    Data distribution statistics is one the foundational elements of cost-based query optimization in modern relational…

    2 条评论
  • L2 Cache Size & SQL Performance

    L2 Cache Size & SQL Performance

    L2 Cache Size Impact on SQL Server Performance Edit 2023-04, Update for Raptor Lake L2 Cache Size Impact on SQL Server…

  • SQL Server Joins - 2 SARGs

    SQL Server Joins - 2 SARGs

    SQL Server Join Costs, 2 SARG The previous Join costs covered join with a search argument on one source only at DOP 1…

    2 条评论
  • SQL Server Join Costs

    SQL Server Join Costs

    SQL Server Join Costs, 1 SARG Here we look at Loop, Hash and Merge joins with an equality value search argument (SARG)…

    2 条评论
  • SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan, Plan vs. Actual Costs Previously in Plan Cost Cross-Over, we looked at the SQL Server…

  • SQL Server Plan Cost Cross-Over

    SQL Server Plan Cost Cross-Over

    SQL Server Plan Cost Cross-Over Modern databases employ cost-based optimization (CBO). One element of CBO is a model…

社区洞察

其他会员也浏览了