Filtered Statistics Tricks in SQL Server
Data distribution statistics is one the foundational elements of cost-based query optimization in modern relational database engines. A statistical method reduces the complete data set to a compact structure while still retaining some ability to estimate distribution. It is not exact and if it were, it would not be statistics. Here it is emphasized that the primary intended mode of operation is to cache a compiled plan for later reuse, including execution with different parameter values. To this purpose, there is no point in arguing for improved accuracy of the statistical method with a brute force approach. It is understood the plan will execute with different parameter values in turn having different number of rows anyways.
The question of actual importance is whether the plan compile parameters are representative of normal or all possible values. This is one area where the data professional should be expected to contribute value. Situations of certain compile parameter values leading to different or undesirable execution plans can be handled any number of ways. Note, the new parameter sensitive plan (PSP) feature of?Intelligent Query Processing?in SQL Server 2022 can cache up to three different versions of the execution plan. Also see Microsoft Cloud Blogs?Intelligent Query Processing.
SQL Server employs a histogram in its statistical method. There is a description of the method in Microsoft Learn under?Statistics. With some degree of knowledge on how the SQL Server histogram works, we can deliberately construct a table with column data distribution that is difficult for this method to model. It is not the intent to demonstrate a weakness, or to assert that this contrived distribution represents actual distribution. In fact, if an actual distribution were to have this characteristic without a work-around, then the person responsible should consider a career change to an area where he will do less damage.
The outcome of this demonstration illustrates the value of in-depth knowledge of the internal workings of SQL Server or other database engines. There may be a belief or expectation that the software architect only needs to build to the business logic, perhaps following the latest fashion (Agile, Scrum, etc.). If so, prepare to be disappointed. Whatever weakness or pitfall exists in an underlying product can be worked around. Of course, it helps to know in advance what needs to be worked around. Also, consider that you are not the first to encounter a particular problem. Someone else probably has encountered the same or similar problem and may have described a solution.
Histogram
The main data structure in SQL Server statistics is a histogram. Per Microsoft documentation, "To create the histogram, the Query Optimizer sorts the column values, computes the number of values that match each distinct column value, and then aggregates the column values into a maximum of 200 contiguous histogram steps ..." The document further describes how the histogram is built in three steps: 1) Histogram initialization, 2) Scan with bucket merge and 3) Histogram consolidation.
It is clearly stated that the Histogram has a maximum of 200 steps. This may cause confusion as it is possible for?DBCC SHOW_STATISTICS?to show up to 201 rows. I believe the reason is that the histogram only accumulates non-NULL values. I did not find in the binary a separate, secret storage location for the number of NULL values. See?Decoding Stats Stream.
The number of NULL values can be computed from the total number of rows subtracting the sum of the Range Rows and Equals Rows. This is shown as the first line in the histogram when NULL values exist. Technically, the NULL value is not part of the Histogram data structure, but it is shown in the Histogram portion of DBCC SHOW_STATISTICS?
I have observed that the histogram shows the most frequent values as?Range high key?when not part of a contiguous block of comparably frequent values. It is not explicitly stated in the documentation that SQL Server statistics tries to do this. If this can be inferred from the description of?Scan with bucket merge, please advise.
When a value is a Range high key, its specific distribution is stored in the?Equal Rows?column. Values in between Range high key values have distribution characterized by the?Range rows?and?Distinct range rows?from which the?Average range rows?of a particular value can be computed.
From the description of the histogram creation process, the lowest value is the first histogram step (except for the artificial NULL step if present) and highest value is the last histogram step.
Test Table
With this understanding of how SQL Server creates the histogram, we shall deliberately construct a data distribution that is difficult for this histogram methodology to model. Note that regardless of the histogram method, it should be possible to construct a distribution that that method has difficulty in modelling.
In our table, our intent is to have 210 values having cardinality 20,000, 1,895 values having cardinality 700 and 18,950 values cardinality 40 representing high, middle and low cardinality values. The total number of values is 210 + 1,895 + 18,950 = 21,055.
The cardinality 20,000 values will be multiples of 100 from 100 up to 21,000. The cardinality 700 values will be multiples of 10 from 10 up to 21,050 excluding multiples of 100. The cardinality 40 values start from 1 running up to 21,055 excluding multiples of 10.
There are more high cardinality values such that each value cannot be assigned as a Range high key withing the 200-step histogram limit. The high cardinality values are not contiguous. So, it is also impossible to have a block of only high cardinality values. Any block between Range high keys must contain a mix of cardinality values.
Furthermore, both the lowest and highest values are low cardinality, meaning two histogram steps are taken for these values.
The 210 values with cardinality 20,000 comprise 4,200,000 rows. There are 1,326,500 rows for the 1,895 values with cardinality 700, and 758,000 rows for the values with cardinality 40. The total number of rows is 6,284,500.
The table definition is below.
CREATE TABLE?dbo.S?(ID?bigint?NOT NULL,?G?bigint?NOT NULL,?H?bigint?NOT NULL,?R?bigint?NOT NULL,?A?int?NOT NULL,?B?int?NOT NULL,?M?bigint?NOT NULL,?N?bigint?NOT NULL,?CreateD?datetime2(3)?NOT NULL,?ModifyD?datetime2(3)?NOT NULL,?CONSTRAINT?PK_S?PRIMARY KEY CLUSTERED?(ID)) -- edit 23-feb, fixed syntax
The column ID is sequential from 1 incrementing by 1, an identity except that it is explicitly populated. The column G has the exact distribution described with a degree of randomness in the location of rows. The column H also has the exact distribution described with common rows populated sequentially.
The column A indicates whether column G belongs to high, middle, and low cardinality with values 1, 2, and 3 respectively. Column B provides the same for column H.
Population Script
The method for achieving the desired distribution is as follows. The total number of rows at each cardinality is calculated (4200000, 1326500, and 758000 for high, middle and low respectively). The block of numbers from 1 to 4,200,000 are assigned to high cardinality. The block from 4,200,001 to 5,526,500 (1,326,500 rows) are assigned to the middle. And the block from 5,526,500 to 6,284,500 (758,000 rows) are assigned to low cardinality values.
In the high cardinality block, the first 20,000 rows are mapped to the first high cardinality value of 100. The next 20,000 rows to the second (200) and so on. In the middle cardinality block, the first 700 rows are assigned to the first middle cardinality value 10. The next 700 rows to the second value, 20, and so on. The same method applies for the low cardinality block. The concept is similar to the Monte Carlo method of mapping uniform random numbers to an arbitrary distribution, simplified (it's been forty plus years since I did scientific computing, and the memory is hazy.)
We could generate the value range 1 to 6,284,500 directly from the SQL RAND function, which returns a float, converted to integer after multiplying by the desired range. This is simplest but does not generate distinct values, and there are duplicate values. A script for this is in the Appendix. The end distribution will still be good.
The method to have all integer values from 1 to 6,284,500 in random is as follows. (see Appendix for dbo.Nums)
CREATE TABLE?#T1(ID?bigint?NOT NULL?PRIMARY KEY,?R?float?NOT NULL)
CREATE TABLE?#T2(I?bigint IDENTITY?NOT NULL?PRIMARY KEY
,?ID?bigint?NOT NULL,?R?float?NOT NULL)
SET NOCOUNT?ON
DECLARE?@N?int?=?10574,?@I?int?=?0,?@tb?bigint?=?6284500
WHILE?@I*@N?<?@tb
BEGIN
?INSERT?#T1(ID,R)
?SELECT?ID,?R?FROM?(
??SELECT?@I*@N+I ID,?RAND(CHECKSUM(newid()))?R?FROM?dbo.Nums
?)?a?WHERE?ID?<=?@tb
?SELECT?@I?=?@I+1
END
GO
Above, the table dbo.Nums is populated with contiguous integers from 1 to 10574.
The final population script is below.
SET NOCOUNT?ON
DECLARE?@d1?int?=?20000,?@d2?int?=?700,?@d3?int?=?40
,?@r1?int?=?210,?@r2?int?=?1895,?@r3?int?=?18950,?@ra?int,?@rb?int
,?@t1?int,?@t2?int,?@t3?int,?@ta?int,?@tb?int
,?@Dt1?datetime2(3)=GETDATE()
SELECT?@t1?=?@d1*@r1?,?@t2?=?@d2*@r2,?@t3?=?@d3*@r3
SELECT?@ra?=?@r1+@r2,?@rb?=?@ra+@r3
SELECT?@ta?=?@t1+@t2,?@tb?=?@ta+@t3
INSERT?dbo.S(ID,?G,?H,?R,?A,?B,?M,?N,?CreateD,?ModifyD)
SELECT?ID
?,?CASE?A?WHEN?1?THEN?100*M?WHEN?2?THEN?10*(M+(M-1)/9)?WHEN?3?THEN?M+(M-1)/9?END?G
?,?CASE?B?WHEN?1?THEN?100*N?WHEN?2?THEN?10*(N+(N-1)/9)?WHEN?3?THEN?N+(N-1)/9?END?H
?,?R,?A,?B,?M,?N,?@Dt1,?DATEADD(ms,ID%1000,@Dt1)
FROM?(
??SELECT?S+1 ID,?R+1 R
??,?CASE?WHEN?R<@t1?THEN?1?WHEN?R<@ta?THEN?2?ELSE?3?END?A
??,?CASE?WHEN?S<@t1?THEN?1?WHEN?S<@ta?THEN?2?ELSE?3?END?B
??, CASE?WHEN?R<@t1?THEN?R/@d1+1?WHEN?R<@ta?THEN?(R-@t1)/@d2+1?ELSE?(R-@ta)/@d3+1?END?M
??, CASE?WHEN?S<@t1?THEN?S/@d1+1?WHEN?S<@ta?THEN?(S-@t1)/@d2+1?ELSE?(S-@ta)/@d3+1?END?N
??FROM?(SELECT?I-1 S,?ID-1 R?FROM?#T2)?a
)?c?WHERE?ID?<=?@tb
GO
The above script has an exact distribution for each value. Some variation can be introduced by substituting the expression below for the subquery inside alias c above.
??SELECT?ID+1 ID,?R+1 R
??,?CASE?WHEN?R<@t1?THEN?1?WHEN?R<@ta?THEN?2?ELSE?3?END?A
??,?CASE?WHEN?S<@t1?THEN?1?WHEN?S<@ta?THEN?2?ELSE?3?END?B
??, CASE?WHEN?R<@t1?THEN?R/@d1+1?WHEN?R<@ta?THEN?(R-@t1)/@d2+1?ELSE?(R-@ta)/@d3+1?END?M
??, CASE?WHEN?S<@t1?THEN?S/@d1+1?WHEN?S<@ta?THEN?(S-@t1)/@d2+1?ELSE?(S-@ta)/@d3+1?END?N
??FROM (SELECT?S ID,?S+CONVERT(int,(CASE?WHEN?S<@t1?THEN?@d1?WHEN?S<@ta?THEN?@d2?ELSE?@d3?END)*R2)?S,?R+CONVERT(int, (CASE?WHEN?R<@t1?THEN?@d1?WHEN?R<@ta?THEN?@d2?ELSE?@d3?END)*R2)?R???FROM?(SELECT?I-1 S,?ID-1 R,?0.25*RAND(CHECKSUM(newid()))-0.125 R2?FROM?#T2)?a
??)?b
Indexes and Statistics
After data population, rebuild the clustered index (Primary Key), which will regenerate index statistics with a full scan. And create nonclustered indexes on the columns having the high, middle, low distribution values.
ALTER INDEX?ALL?ON?dbo.S?REBUILD WITH(SORT_IN_TEMPDB=ON,?MAXDOP=1,?FILLFACTOR=100)
CREATE INDEX?IXG?ON?dbo.S(G)?WITH(SORT_IN_TEMPDB=ON,?MAXDOP=1,?FILLFACTOR=100)
CREATE INDEX?IXH?ON?dbo.S(H)?WITH(SORT_IN_TEMPDB=ON,?MAXDOP=1,?FILLFACTOR=100)
Queried Distribution
The query below verifies distributions.
SELECT?A,?COUNT(*)?ct,?MIN(G)?MiG,?MAX(G)?MxG,?MIN(c)?MiC,?AVG(c)?AvC,?MAX(c)?MxC
FROM?(SELECT?A,?G,?COUNT(*)?c?FROM?dbo.S?GROUP BY?A,?G)?a
GROUP BY?A?ORDER BY?A
A similar query is used with columns B and H substituting for A and G
Below are the results of the above query for a previous data distribution for 214 high, 1930 middle and 19300 low cardinality values of 10000, 1000 and 100 rows and having some degree of randomness.
The number of rows for each value falls in narrow band, between 9746 and 10327 for high, 888 to 1106 for middle and 59 to 138 for low.
Check the distribution of the index IXG with close but not exact cardinality as follows.
DBCC SHOW_STATISTICS('dbo.S',IXG)
Below are the first nine rows of the histogram. Note this is FullScan statistics.
Below are the last nine rows of the histogram.
As expected, the SQL Server data distribution statistics captures the low and high values as Equal rows. Here, we have deliberately made the low and high values low cardinality. That leaves only 198 steps for values in between. Even then, SQL Server statistics consolidated the available 200 down to 192 steps. Furthermore, the second to top step is just one less, having zero range rows between the second to top and the top steps.
Of the interior steps, almost all are high cardinality values. A few are middle cardinality values. When steps are successive high cardinality values, for example 100 and 200, then the number range rows at exact distribution would be 9*1000 + 90*100 = 18000 and the number of distinct range rows is 99, and the average range rows is about 180.8.
For steps 200 and 310, there are 109 distinct range rows, 28900 range rows at exact distribution for 265 average range rows.
For the current distribution of 210, 1895 and 18950 high, middle and low cardinality values of 20000, 700 and 40 respectively with some degree of randomness, the distribution is below.
DBCC SHOW_STATISTICS for full scan is below.
As before, the first and last steps represent the low and high values respectively. Also as before, the second to top step is one less than the top value and all three (low, second from top and top) are low cardinality values by deliberate decision.
For steps between successive high cardinality values, 200 and 300 for example, there are 99 distinct range rows, 9*700 + 90*40 = 9900 range rows at exact distribution and 100 average range rows. When successive steps span two high cardinality values, at exact distribution, there are 20000 + 18*700 + 180*40 = 39800 range rows for 200 average range rows.
Test Queries
Below are test queries using knowledge of distribution statistics for the index in question.
SELECT?*?FROM?dbo.S?WHERE?G?=?1?AND?1=1
SELECT?*?FROM?dbo.S?WHERE?G?=900?AND?1=1?OPTION(MAXDOP?1)
SELECT?*?FROM?dbo.S?WHERE?G?=?1000?AND?1=1
The meaningful part of the query ends with G = value. The following 1=1 is meant to suppress auto-parameterization and can be ignored.
The first query has search argument (SARG) value 1 which is the first step Range high key, for which the query optimizer can use the Equal rows value of 64.
The second query with SARG 900 is also a range high key, with cardinality 20030. Notice the operation is Clustered Index Scan with Subtree Cost 53.0066.
If we forced the query to ForceSeek, the Key Lookup has Subtree cost 56.68 as shown below.
The third query has SARG value of 1000 which is between two range high keys, hence the plan estimates 199.56 rows when we happen to know that is one of the high cardinality values with 20000 rows.
Even though the statistics estimates is significantly off in this case, estimating just under 200 rows instead 20,000, we know from?Key Lookup, we know that the crossover from Index Seek + Key Lookup to table scan occurs far too soon when data is in the buffer cache and that we really want the plan generated with the lower estimated number of rows.
Application Architecture
The general understanding is that application architect need only focus on the business logic. With a properly normalized database, good SQL, the belief is that everything should work, perhaps with some tuning effort.
Yet in this statistics example, we encounter a problem that will not be solved even with a statement recompile. Whether it should be solved is a different matter. This is because the parameter value could fall in the range between Range high key values but has a distribution very different than the statistics value.
Accurate row estimates can be critically important in the formation of a good execution plan. No statistical method is perfect, otherwise it would not be a statistics method. The method employed by SQL Server uses a histogram data structure as described. The histogram data structure has certain capabilities and weaknesses.
With this knowledge, we can devise a strategy that will to its strengths and avoid its weaknesses. Consider the example of a Customers table and an Orders tables. In a database designed strictly to software and database principles, the Customers table may have CustomerID as an identity column. The Orders table has CustomerID as a foreign key. Some customers place frequent orders and others infrequently. If there are more than 200 high volume customers, then the histogram method may not be able to capture the top customers on the Range high key.
Suppose we knew at the time of account creation the approximate volume a customer is expected generate. Instead of CustomerID being an Identity column, it is assigned from one of three sequences. The first sequence starts at 10001. The second sequence starts at 100001 and the third sequence starts at 10000001. The statistics for this are shown below.
It would appear that this histogram does not know the exact boundary between high and middle, and between middle and low cardinality. But this is not a serious problem as there is never the issue of a high cardinality value being assigned to a low value.
The second part of implementing this strategy is for the plan to compile with parameter value of the same group as the execution values. This can be done in a number of ways. There is effort involved. If no effort were involved, then the salary/compensation of the low effort job should be reflected approximately.
Work Arounds
In the event that the database has already been built and populated before this nature of statistics was considered, we can look for a workaround. The workaround here involves three elements.
In the parent table of the foreign key, Customers, we add a code column A, example only, which may have allowed values 1, 2 or 3. Value 1 is assigned to Customers having high frequency orders, 2 for middle and 3 for low. The Orders table has the same code column A. The pair CustomerID and code column A must match between Customers and Orders. This is best done by creating a unique index on Customers CustomerID, A. And the Order table foreign keys to Customers on CustomerID, A instead of just CustomerID (which is unique by itself). If it is necessary to update the code column, then consider options including but not limited to UPDATE CASCADE.
The second element of this strategy is to create filtered statistics as below.
CREATE STATISTICS?ST_G1?ON?dbo.S(G)?WHERE?A?=?1?WITH FULLSCAN
CREATE STATISTICS?ST_G2?ON?dbo.S(G)?WHERE?A?=?2?WITH FULLSCAN
CREATE STATISTICS?ST_G3?ON?dbo.S(G)?WHERE?A?=?3?WITH FULLSCAN
Below are the filtered statistics histogram for high, middle and low cardinalities.
The third element is rewriting the SQL as below. These examples deliberately chose values that are not Range high keys of the (unfiltered) index IXG. The estimate rows must be based on average range rows of the index or the filtered statistics.
SELECT?*?FROM?dbo.S?WHERE?G?=?1000?AND?A=1?OPTION (MAXDOP?1)
SELECT?*?FROM?dbo.S?WHERE?G?=?910?AND?A=2
SELECT?*?FROM?dbo.S?WHERE?G?=?901?AND?1=3
Below is the index seek operation for the first query above, G=1000 AND A=1
The estimated number of rows 19918 is from the filtered statistics Equal rows value.
Below is the index seek operation for the second query above, G=910 AND A=2
The estimated number of rows 700.864 is also from the filtered statistics, this time the average range rows.
Below is the index seek operation for the second query above, G=901 AND A=3
Here the optimizer uses the statistics on the index, reading from the average range row section 199.563. The execution plan XML does say it uses both index and filtered statistics as shown below.
I am presuming the query optimizer believes 199 rows are expected based on G=1. Then the Key Lookup applies predicate A=3, which will reduce rows to 40.
In the previous cases, even though the index statistics estimate is 199 rows, the query optimizer knows from the filtered statistics that there are more rows.
Summary
Yes, it is true there are limitations in the predictive capability of the histogram data structure employed by SQL Server for data distribution statistics. A brute force approach of increasing the number of histogram steps simply defers the point at which Equal rows cannot be captured for high cardinality values. Further pushing the size of histogram data structure will eventually reach the point in which it is no longer a statistics approach.
In any case, we have some workarounds. One is best done at the start of a project involving a strategy of assigning surrogate key from separate ranges based on expected cardinality. If it is already too late for that, then we can always do tricks with filtered statistics and write SQL in a manner that SQL Server can use the appropriate filtered statistics. This may require additional columns in one or more tables. Neither are effortless implementations. Effortless solutions do not require a skilled database professional, with accompanying implications.
There are also other issues in SQL Server statistics regarding sample size. See?Statistics that need special attention.
References
The primary reference on SQL Server Statistics from Microsoft is:?Statistics Used by the Query Optimizer in Microsoft SQL Server 2008?Writer: Eric N. Hanson and Yavor Angelov Contributor: Lubor Kollar
See?Changes to automatic update statistics in SQL Server – traceflag 2371?and also Controlling Autostat?AUTO_UPDATE_STATISTICS?behavior in SQL Server
Microsoft Learn?Statistics
Appendix
Below is my script for creating and populating a number sequence table, in turn used in generating test data.
CREATE TABLE?dbo.Nums(I?int?NOT NULL)
GO
SET NOCOUNT ON
DECLARE?@I?int?=?1
BEGIN TRAN
WHILE?@I?<=?10574
BEGIN
?INSERT?dbo.Nums?VALUES(@I)
?SET?@I?=?@I?+?1
END
COMMIT TRAN
GO
Sr SQL Server DBA
1 年Thank you
30K 1st level connections | Servant Leader | Cloud DBA/DBE/Developer | #ladataplatform #sqlsatla #sqlsatsv #sqlsatoc #sqlsatsd
1 年#ladataplatformweeklylinks ??