Introduction
Incorrect cardinality and Cost Estimation may lead query optimizer to choose inefficient plans which can have a negative impact on the database performance. In this article, we are going to look deep inside Statistics - how statistics are obtained by the query, and manually calculate the same with examples and scenarios and how optimizer thinks. I am going to focus on Density in this article.
Queries that I am demonstrating in this article are prepared against the “AdventureWorks2012” database and I am using “Sales.SalesOrderDetail” table.
Cardinality Estimate
Estimated number of records that will be returned by filtering, joining predicates, and using group by clause.
Note - Auto Update and Auto Create statistics give you better performance.
Below are the outputs from sys.stats table.
object_id | Name | stats_id | auto_created | user_created | no_recompute | has_filter | filter_definition | is_temporary |
1509580416 | PK_Person_BusinessEntityID | 1 | 0 | 0 | 0 | 0 | NULL | 0 |
1509580416 | IX_Person_LastName_FirstName_MiddleName | 2 | 0 | 0 | 0 | 0 | NULL | 0 |
1509580416 | AK_Person_rowguid | 3 | 0 | 0 | 0 | 0 | NULL | 0 |
1509580416 | _WA_Sys_00000002_59FA5E80 | 4 | 1 | 0 | 0 | 0 | NULL | 0 |
Creating and updating statistics
Statistics are created/udated in various ways automatically by the Query Optimizer.
- If Statistics is not created on particular column, then it will be automatically created by the Optimizer.
If you see in the above table “_WA_Sys_00000002_59FA5E80”, this statistics is created automatically by Optimizer.
_WA_Sys_00000002_59FA5E80 - WA means Washington (SQL Server developers sit here)
- 00000002 - Column ID
- 59FA5E80 - is the hexadecimal equivalent of the object_id
If Plan already exists in the Plan Cache and Statistics used by the Optimizer is outdated, then Optimizer discards the plan from plan cache, updates the statistics, and creates new plan.
- When we create an Index, it will automatically create statistics against it.
- Explicitly create statistics on particular column.
Different ways statistics get updated,
- Synchronous Update Statistics - Optimizer has to wait until the update statistics to complete before optimizing and executing query.
- Asynchronous Update Statistics - Optimizer need not wait to update statistics it will use existing statistics.
How to view Statistics?
Syntax of Statistics
- dbcc SHOW_STATISTICS({
- 'table_name' | 'view_name'
- }, target)[WITH {
- [NO_INFOMSGS] < option > [, n]
- }] < option > :: = STAT_HEADER | DENSITY_VECTOR | HISTOGRAM
Example of Statistics
Query
DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID')
Output
You can get all these 3 parts separately by executing below queries along with “with” syntax as below,
- DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with STAT_HEADER
- DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with HISTOGRAM
- DBCC SHOW_STATISTICS('Sales.SalesOrderDetail','IX_SalesOrderDetail_ProductID') with DENSITY_VECTOR
When you look into show_statistics result you will get below as output,
- Stats Header
- Density
- Histogram
Density
Density is used when query having Group by Clause or Equality Predicate with UNKNOWN values.
Scenario ID | All density | Average Length | Columns |
#1 | 0.003759399 | 4 | ProductID |
#2 | 8.2429E-06 | 8 | ProductID, SalesOrderID |
#3 | 8.2429E-06 | 12 | ProductID, SalesOrderID, SalesOrderDetailID |
Density is calculated based on the 1/Number of Estimated Rows (i.e. count).
Count | Density Based on Manual Calculation | Reciprocal |
266 | 0.003759398 | 265.99996 |
121317 | 8.24287E-06 | 121316.53 |
121317 | 8.24287E-06 | 121316.53 |
Below are the different scenarios and how statistics is getting calculated to qualify the number of rows required by the particular query.
Scenario 1 Group By
Query
SELECT ProductID FROM Sales.SalesOrderDetail GROUP BY ProductID
How statistics is getting used to retrieve Number of rows got selected for above query?
Formula used
1/Density value of column ProductID
I.e. 1/0.003759398 = 265.99996
| Number of rows |
Group By ProductID | 265.9999644 |
Group By ProductID, SalesOrderID | 121316.9979 |
Scenario 2 Equality Operation
In Equality to get rows from statistics it will use the below formula,
Formula Used
Density Value of column * Count of rows
I.e. 8.24287E-06 * 121317 = 456.0790085
Equality Operation | Number of rows |
ProductID = 970 | 456.0790085 |
Scenario 3 Inequality Operation
Inequality Operation will use below formula,
Formula Used
30% * Count of rows
I.e. 30% * 121317 = 36395.1
Inequality Operation | Number of rows |
ProductID < | 36395.1 |
Summary
In this article, I explained how to look into statistics, what are the different outputs that statistics have, how Query Optimizer uses statistics for Group By, Equality, and Inequality operation. Also, I explained in depth what are the formulas being used by the Optimizer internally to identify number of rows that qualify for particular queries, based on the different selectivity.