L2 Cache Size & SQL Performance

L2 Cache Size & SQL Performance

L2 Cache Size Impact on SQL Server Performance

Edit 2023-04, Update for Raptor Lake

L2 Cache Size Impact on SQL Server Performance

For a very long period from the 2008 Nehalem architecture, Intel stayed on a dedicated 256K L2 cache for each core supported by a larger inclusive shared L3 cache in their mainline processors. This pattern held to 2017 when the?Skylake-based?Xeon SP increased L2 to 1M L2, with the L3 implemented as a?victim cache. On the client-side, L2 cache increased to 512K in 2019 with the Ice Lake mobile processor. Intel desktop stayed on 256K cache up to the 2019?Comet Lake?processor. Then over a 19-month period in successive generations, L2 cache increased to 512K with?Rocket Lake?in Q1-2021, 1.25M with?Alder Lake?in Q4-2021, and now 2M for?Raptor Lake?in Q4-2022.

No alt text provided for this image

This succession presents an opportunity to investigate the impact of L2 cache size on performance for processors in a moderately narrow window of architectures and manufacturing processes. Our topic of interest is of course SQL Server. Note that Intel disables ECC memory capability in client processors. Server systems should have ECC memory, but so should client systems. There are currently versions of Intel Xeon E processors for Comet Lake and Rocket Lake, but not Alder Lake? Here, we are purely interested in the impact of L2 on performance and not to validate systems without ECC memory. Ideally, we would like to be able to distinguish between L2 cache, core micro-architecture, frequency, and memory, but we must make do with the product variations available.

Background

Long ago, before the era of multi-core processors, I reported?performance characteristics?for then contemporary processors, including the Pentium III 733MHz with 256K L2 and Pentium III Xeon 700MHz with 2M L3. The Xeon has eight times larger L2 cache size, 33MHz lower core frequency, and the same for front-side bus. We could speculate that the difference in front-side bus, 133MHz and 100MHz, was not a large effect due to the nature SQL Server memory access, but this is just a guess. A difference between then and now, aside from twenty years, is that L2 was the last cache level for the Pentium III processors. The current processors have L3 behind the L2 cache. One question is: does the large L3 negate the impact of L2 size? Obviously not or else Intel would not have increased L2 size. In looking over performance, we may even wonder why it took so long to move up from 256K.

Test Systems

The three test systems I currently have are:

?Comet Lake Core i7-10850K, 3.6GHz base

?Rocket Lake Core i7-10700K, 3.6GHz base

?Alder Lake i9-12900K, 3.2GHz base

All systems were on Windows 11. The Comet Lake and Alder Lake systems are running SQL Server version 2022 RTM and the Rocket Lake is on version 2019, build 4236. All three systems were set to Power Option Ultimate Performance. With a single active thread, frequency appears to be 4.8GHz. The Comet Lake system, as installed, did not show Ultimate Performance as one of the Power Plan options. The initial setting was the High Performance resulting in 4.2GHz processor frequency. This was later corrected and the results in this document now reflect a common setting for all three systems.

All three systems now use Crucial DDR4 3200MT/s CL22 memory. The Comet Lake system memory was 2x16GB and the Rocket Lake and Alder Lake systems had 2x32GB. SQL Server memory was set to 28GB. On the Alder Lake, the UEFI detected the memory as 2933MT/s and CL21. This could have been but was not overridden. The belief is that the difference in memory data transfer rate is immaterial. There might be a small advantage for the Comet Lake and Rocket Lake systems as CL22 at 3200MT/s is 13.75ns while CL21 at 2933MT/s is 14.32ns.

The Rocket Lake system was previously configured with XMP memory rated for 3600MT/s at 16-19-19 latencies. The system was not entirely stable with XMP enabled. When XMP is disabled, the memory defaults to 2133MT/s at CL15.

Also interesting is that Alder Lake has both P and E cores. The P core (Golden Cove) is the continued evolution of the mainline core. There are two modules of four E cores, each sharing 2M L2 cache. The E core (Gracemont) is the latest efficiency oriented architecture in the Atom product line. Performance results for Alder Lake P and E cores are reported separately using the SQL Server setting for process affinity.

Comment: in the dual-core period, the Intel Conroe 65nm architecture had two cores sharing a large 4M L2 cache, and 6M in Penryn 45nm. The next architecture, Nehalem, was quad-core in client. Each core had a 256K dedicated L2, and all cores shared 8M L3. At this time, it was stated the reason for the change from large shared L2 to small dedicated L2 plus shared L3 was that multi-port cache had latency implications at more than two cores. True, 13 years have passed. But why is it now a good idea for the quad-core E module to share L2?

Each processor/core is a different micro-architecture. Intel claims?Rocket Lake?has double digit percent IPC increase over Comet Lake. The claim is 19% IPC for the Golden Cove (P core)?Alder Lake Microarchitecture?at same frequency compared to the Rocket Lake Cypress Cove core.

Test Tables, Queries and Methodology

Test tables were populated as described. Each pair of tables have an exact distribution (cardinality) for the unique cluster key columns. The set of tables spans a broad range of distributions. All queries to any single pair of matching tables joined touch the same number of rows regardless of input parameter. Each table is 100MB or larger, at least three times larger than the total L3 (Alder Lake, 30MB).

Comment: It would be interesting to evaluate 30-100MB tables when AMD?Zen 4?based Ryzen processors with V-cache becomes available. The Zen 4 Core Compute Die (CCD) has eight cores, each with 1M dedicated L2 cache, and 32MB shared L3. The V-cache version places a 64MB SRAM die over the CCD for 96MB L3. In the 2-CCD product, there is a combined 192MB L3, though having higher access latency when crossing the IOD to the remote CCD.

The test queries forced Loop, Hash or Merge joins between two tables having matched data distribution (cardinality). Queries are restricted to degree of parallelism (DOP) 1. The output is an aggregate (MAX) of a single column, the last column in the row. This is so that the impact of the output is minor and the same for all test cases.

The test methodology was to run a single thread for multiple iterations with a target total worker time of several seconds as reported by?sys.dm_exec_query_stats. The more correct methodology would be to run multiple threads, driving all cores to CPU saturation. This was the approach for the Pentium III tests twenty years ago. But that is a time-consuming process. On the other hand, the single thread tests are simple and quick to run with much better automation.

Note: this article builds on prior work?Join costs,?Join 2-SARG?and?Plan Cost Cross-over?published on LinkedIn and?QPMA. Some details are in the prior works and not here.

Caveats with the single thread test include the following:

1) a single core may be able to run at higher frequency than all cores, but given that SQL Server is not using floating point, this type of workload may not be thermally limited.

2) no assurance that all cores will not encounter some form of contention. In the Pentium III testing methodology, thread contention was observed at the table lock escalation point.

3) the benefit, positive or negative, of hyper-threading is not known.

Performance Model

Query performance is presented on the model of a base cost plus the cost of incremental rows. The base is the cost to start the query execution plan. Each subsequent row incurs additional cost. Obviously, this is a highly over-simplified model that cannot represent the true complexity of underlying operations. But it is sufficient so that the measured query CPU time is displayed in a more digestible manner.

As described, the base cost is the measured cost of the query at 1 row minus the estimated cost per row based on queries touching more rows.

??Query Cost = Base Cost + Rows × Cost per Row

The first problem in this model is that there is a bigger step up from the query touching a single row in each of the two tables joined to the query touching two rows than for subsequent rows.

If I had to guess, I would say the SQL Server query optimizer sees that there can be only one row in the first query and so there is no need to aggregate. Regardless of the actual reason, for the test query form, it is better to treat the 2-row query as the baseline.

From here, we estimate the per row cost, then subtract twice the estimated per row cost from the two-row query worker time to arrive at the base cost.

Join Base Cost

Proceeding with the overly simplified cost model, using the 2-row query as the baseline, the Loop Join 1S and 2S base costs are shown below. Units are CPU-microseconds (μs).

No alt text provided for this image

In the Loop 1S query, the rows from the outer source are grouped together based on an equality search argument value. The outer source joins to an inner source table having a clustered indexed on a single sequentially increasing column as key, functionally an identity value, except it was explicitly populated. The data for these two tables are populated in such a manner that successive accesses to the inner source go to different pages until there are more than 64K rows from the outer source.

For the Loop 2S query, both tables are clustered on the same grouping column (GID), used as the equality SARG. The second cluster key column (SID) is the join condition resulting in accesses going to successive rows. The database engine still navigates the index B-tree. Our expectation is that memory is more likely to be in the processor (core) L2 cache.

The Merge join query uses the same two tables as the Loop 2S query with base costs shown below.

No alt text provided for this image

The unique clustered index on each table has two columns. First is the grouping column, for which an equality SARG is specified. The second is a sequential value forming a unique pair when combined with the first column. The join condition is on the second column. The first column is also in the join condition, providing the equality SARG for the second table.

There is another column, BID, having the same value as SID. When the join condition for the second table is on this column, then specifying a MERGE join requires the results to be sorted prior to the join, shown as Sort + M (sort before merge).

No alt text provided for this image

Below are the Hash join base costs also using the 2-row query as baseline.

No alt text provided for this image

We can clearly see a significant reduction in base cost from Comet Lake with 256K L2 to Rocket Lake with 512K L2 and again to Alder Lake P core having 1.25M L2. This is attributed to the change in L2 cache size instead of differences in architecture. The reason is that on starting a new operation, any architectural improvements will not help much until the L1 and L2 caches have the right data. Otherwise, it is just an exercise in memory access, which is not very different between the systems.

On completing one query, the same query is executed again in a separate RPC (LPC) with the parameter/variable value using a generated random (in range) value. The expectation is that some necessary data (pages) resides in the processor cache, and other data needs to be fetched from memory.

Technically, we should use a tool that can track L2 cache hit rate to validate this assertion. If anyone has experience with Intel?Performance Counter Monitor, please advise.

If it is true that the difference in L2 cache is the main source of difference in the base cost of operations, then what is the contribution of the large L3? In the Pentium III test twenty years ago, between 256K and 2M L2, there was no L3 cache behind L2.

The Alder Lake E cores have 2M L2 shared among four cores. In the single thread test, that one active core should have access to the entire 2M L2. Yet the base cost is only slightly lower than Comet Lake. Are core architecture or L2 dedicated/shared factors?

Another guess would be in the type of SRAM cell for L2 cache. Intel has three different types of SRAM per?WikiChip. There are High Performance (0.0441μm2), High-Density (0.0312μm2) and Low Voltage (0.0367μm2). We expect that Alder Lake will use high-performance SRAM for the P-core L2. Which SRAM does the E-core Gracemont Cove use for L2? From the die images available, it almost seems the 2M E-core L2 is smaller than the P-core 1.25M L2?

Loop Join Cost Per Row

If our model of base cost plus per row cost is good, then the complete query cost minus base cost divided by rows should be a flat line.

??Per Row Cost = (Query Cost - Base Cost) ÷ Number of Rows

Below is the Loop 1S cost per row (on vertical axis) versus rows (horizontal axis) for Comet Lake (256K L2), Rocket Lake (512K L2), and Alder Lake P cores (1.25M L2) and E cores.

No alt text provided for this image

While this is not a pretty line, it is sufficiently flat to support the basis of fixed plus incremental cost model. Over 64K (65,536) rows, the cost per row drops as successive inner source accesses becomes more likely to go to the same leaf level page as the preceding row.

Rocket Lake is a substantial improvement over Comet Lake. Alder Lake is comparable to Rocket Lake. The difference is withing run-to-run variations. The eye-ball estimates for Loop Join 1S (random access to inner source) are Comet Lake 1.4μs per row, 1.2 for Rocket Lake and Alder Lake P-core, and 1.7 for Alder Lake E-core.

This initial tests with Comet Lake in the High-performance setting (4.2GHz), cost was comparable to Alder Lake E-core. It is unclear whether the difference in Comet Lake between High and Ultimate performance is due to frequency or variations from power management activity. It would helpful if the test could run at a lower frequency but otherwise with power management disabled.

The Loop 1S join is expected to be an exercise in memory roundtrips, touching memory locations far larger than the processor cache, even the combined L3. The memory round-trip latency differences between the three systems are not large, even for Rocket Lake at 2133MT/s data rate. The pattern is predictable to us because we know how the data is populated. Example, for 4096 rows, successive rows are separated by exactly sixteen pages at the leaf level. But SQL Server must navigate the B-tree indexes without this knowledge. The difference of 0.4μs per row is about five memory roundtrips (75-80 nanoseconds), or 1600 CPU-cycles for a 4GHz core.

Below are the Loop 2S costs per row. Now we have as flat a line as we can get from measured data. There is some discrepancy at lower rows counts.

No alt text provided for this image

The Loop 2S join shows separation between all processor cores architectures. Whereas Loop 1S was an exercise in memory roundtrips, the Loop 2S touches memory locations in sequence even though B-tree index navigation is required for each inner source access. In this, we could be seeing the benefit of L2 cache size as well as any IPC improvements between processor architectures.

There is a minor glitch near 8890 and 8891 rows. This is the row lock to table lock escalation point, which does not seem to have significant effect in the Loop join.

There is a drop in CPU per row between 131,071 and 131,072 rows. This is due to the aggregation operation changing from Stream Aggregate in Row Mode to Hash Match aggregation in Batch Mode.

No alt text provided for this image

Notice the Stream Aggregate?Estimated Execution Mode?is Row while the Hash Match?Estimated Execution Mode?is Batch. Only the Aggregation becomes Batch Mode, not the Loop join.

No alt text provided for this image
No alt text provided for this image

The Loop 2S query should run in processor cache (different from SQL Server Buffer cache) and is expected to illustrate both differences of frequency and processor architecture IPC. The performance gain from Comet Lake to Rocket Lake is about 21%. There is another gain of 24% for Alder Lake P-core. Alder Lake E-core performance is about 5% lower than Comet Lake.

It might be reasonable for the Alder Lake P core to be more than 50% faster than the Gracemont E core. But did anyone expect the Alder Lake E core to be close to Comet Lake? It is true that the SQL Server engine does not use anything close to the full wide superscalar capability of Intel mainline cores. The E cores could be a good choice for throughput-oriented workloads.

Merge Join Cost Per Row

Below is Merge join CPU per row versus rows. We can see that at the lower row counts, 4-16 rows, the base cost and cost per row model is not entirely valid. The interpretation is that the costs are higher at low rows in addition to the base cost.

No alt text provided for this image

At 8891 rows, there is a sudden drop in cost per row compared to 8890 rows and below. If we were the chart the Merge join costs in read committed together with the read uncommitted transaction isolation levels, we would see that both costs at 8891 rows and above line up. At Read Uncommitted, the lower cost of the upper range occurs at lower rows as well.

The interpretation is that by default (read committed), row locks are employed up to 8890 rows. At 8891 rows and above, a table lock is taken. The cost of the row lock is just over 0.3μs per row for Comet Lake and Alder Lake E cores, and just over 0.2μs per row for Rocket Lake and Alder Lake P cores.

As in the Loop join, at 131,072 rows, the Aggregation operator changes to Batch mode. The Merge join itself remains Row mode, and there is a small performance gain from just the Aggregation changing to Batch mode.

Hash Join Cost Per Row

Below is Hash join CPU per row versus rows. This has the worst fit to the base plus per additional row cost model of the joins evaluated.

No alt text provided for this image

Below is Hash join complete query CPU in micro-seconds versus rows at the lower row counts. As mentioned earlier, there is a cost for the single row query. The step up from one to two rows is larger than the incremental cost per row at higher rows.

No alt text provided for this image

Hence, the base cost was set at 2 rows, which is reasonable for the other join types.

For the Hash join, the incremental cost per row between two to eight rows is erratic. There could be a spurt, or it could be almost no cost. Above 8 rows, the incremental cost model is consistent as shown below.

No alt text provided for this image

Below is Hash join CPU per row versus rows at higher rows from 1K to 256K (binary). There are two items of interest. First is the lock level transition between 8890-8891 rows. Second is the Batch mode transition at 128K (in binary, 131,072 decimal) rows.

No alt text provided for this image

As in Merge join, the lock transition indicts the row lock cost is about 0.3 CPU-μs for Comet Lake and Alder Lake E cores and about 0.2 CPU-μs for Rocket Lake and Alder Lake P cores.

The Loop and Merge joins exhibit a small performance gain with Batch mode at 131,072 rows for the Aggregation operator only. In the Hash join, the Hash Match join operator also occurs in Batch mode and the performance gain is double or more.

It is possible to induce Batch mode with false statistics. Batch mode has excellent performance gains over row mode down to 16K rows. Lower cardinality was not tested. Here Batch mode is induced by grafting the statistics from an appropriate table. This is because in each table, all group (GID) values have the same cardinality. In a?"real"?table, we would simply use OPTIMIZE FOR and pick an appropriate value that has the desired cardinality. The tradeoff of course is higher memory grant.

What is surprising is that Rocket Lake has better performance than Alder Lake in the Hash join when Batch mode is active. Is there a reason for this? Did SQL Server correctly detect the L2 cache size for Alder Lake P core?

Sort, Merge Join Cost Per Row

Below is Sort then Merge join CPU per row versus rows. There are two points to note. One, the lock transition at 8890-8891 rows does not exist.

No alt text provided for this image

The belief here is that the Sort plus Merge just takes a table lock to begin with. This is the reason that the cost of the Sort plus Merge join is less than the plain Merge join in read committed up to 8890 rows. But the plain Merge is less expensive as expected after the lock transition at 8891 rows when both queries are on equal lock level footing.

Second, the Batch mode transition at 128K rows has larger impact than for the plain Merge join, but less than that for the Hash join. This is because the Sort operation is done in batch mode, while the Merge join is done in Row mode.

Summary

There are many different topics we could have covered with this data set. The differences attributed to L2 cache size. The differences in processor architecture from Comet Lake to Rocket Lake (Cypress Cove) to Alder Lake, both the Golden Cove P core and the Gracemont E core. Additional tests with the same systems could have included frequency variations (turbo disabled). In SQL Server, it could have been Row and Batch Mode operation.

However, the focus was on L2 cache size. There are strong indicators that L2 cache size impacts the startup (base) cost of a query, as was true twenty years ago. It should not matter much what the architecture or frequency is if key data is not in cache (E-core results say otherwise?). That the same L2 effect occurs in modern processors with core dedicated L2 cache backed by processor shared L3 indicates the large L3 does not mitigate a small L2. Intel may have known this for some time but only recently felt under pressure to increase L2 size?

A secondary topic worth discussing is processor core architecture. Rocket Lake has better performance than Comet Lake in all tests for both base cost and successive rows. It is possible the performance gain in successive rows is attributed to IPC gains instead of cache size. Alder is comparable to Rocket Lake in Loop 1S, and comparable to some degree in the Merge, Hash and Sort + Merge joins. Alder Lake is better than Rocket Lake in Loop 2S, in the Merge join above the lock escalation point, and in the Sort + Merge up to the Batch mode transition point. Rocket Lake is better than Alder Lake in the Hash and Sort + Merge joins above the Batch mode transition point.

The surprise here (or not?) was that the Alder Lake E core is not far from Comet Lake. We must be cautious of this because we only have single thread active tests. A full system saturation test should be employed before drawing firm conclusions. The comparison should be for both logical processors of one P (mainline) core against four E cores in the same module.

An explanation from Microsoft on Alder Lake P core Hash join Batch Mode would be appreciated. Of course, higher priority would be for SQL Server to correctly identify a mixed core system, in which P cores have hyper-threading while E cores do not. A proper strategy for using each type of core is necessary. For parallel execution plans to correctly use mixed cores, a different method of allocating work between threads is necessary, and this should be done for other reasons as well.

References

The previous articles (on LinkedIn)?Join costs?covered join with a search argument on one source only at DOP 1. Parallel plan joins were discussed in?Parallel Plan Cost?at DOP 2. Joins with search arguments on each source in?Join 2-SARG.

This is the fifth article in the Plan Cost versus Actual series beginning with?Plan Cost Cross-over, and?Key Lookup, followed by the Join and Parallel Plan articles above. Additional details are in (QDPMA)?Loop,?Merge?and?Hash?Joins.

Dima Pilugin?SQL Server 2019: Batch Mode on Rowstore

Microsoft SQL Server Team?Introducing Batch Mode on Rowstore Microsoft documentation (learn.microsoft.com)?Query processing architecture guide. The official Microsoft document on?Joins. Also, configure?cost threshold for parallelism.

要查看或添加评论,请登录

Joe Chang的更多文章

  • US Population from Neilsberg vs SSA (Elon)

    US Population from Neilsberg vs SSA (Elon)

    update : data from census.gov, not sure if actual or estimated, additional bracket for 80-89, 90-99 and 100+.

    1 条评论
  • single to multi-socket scaling

    single to multi-socket scaling

    Lenovo and AMD recently published TPC-E benchmark result for the 2-socket EPYC 9554. Most recent AMD TPC-E results have…

    5 条评论
  • SQL Server Performance from Intel Comet Lake to Raptor Lake

    SQL Server Performance from Intel Comet Lake to Raptor Lake

    A year ago, I reported on the performance characteristics of basic SQL Server operations (L2 Cache Size…

  • Insert, Update and Delete Tricks

    Insert, Update and Delete Tricks

    The previous articles Insert, Update and Delete Plan Cost, Asynchronous IO and Storage Arrays and IUD Performance…

    1 条评论
  • Filtered Statistics Tricks in SQL Server

    Filtered Statistics Tricks in SQL Server

    Data distribution statistics is one the foundational elements of cost-based query optimization in modern relational…

    2 条评论
  • SQL Server Joins - 2 SARGs

    SQL Server Joins - 2 SARGs

    SQL Server Join Costs, 2 SARG The previous Join costs covered join with a search argument on one source only at DOP 1…

    2 条评论
  • SQL Server Parallel Plan Cost

    SQL Server Parallel Plan Cost

    SQL Server Parallel Join Costs Up until the early 2000's, microprocessor vendors focused on improving the performance…

  • SQL Server Join Costs

    SQL Server Join Costs

    SQL Server Join Costs, 1 SARG Here we look at Loop, Hash and Merge joins with an equality value search argument (SARG)…

    2 条评论
  • SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan

    SQL Server Key Lookup & Scan, Plan vs. Actual Costs Previously in Plan Cost Cross-Over, we looked at the SQL Server…

  • SQL Server Plan Cost Cross-Over

    SQL Server Plan Cost Cross-Over

    SQL Server Plan Cost Cross-Over Modern databases employ cost-based optimization (CBO). One element of CBO is a model…

社区洞察

其他会员也浏览了