Approximate COUNT DISTINCT
We all have written queries that use COUNT DISTINCT to get the unique number of non-NULL values from a table. This process can generate a noticeable performance hit especially for larger tables with millions of rows. Many times, there is no way around this. To help mitigate this overhead SQL Server 2019 introduces us to approximating the distinct count with the new APPROX_COUNT_DISTINCT function. The function approximates the count within a 2% precision to the actual answer at a fraction of the time.
Let’s see this in action.
In this example, I am using the AdventureworksDW2016CTP3 sample database which you can download here.
- SET STATISTICS IO ON
- SELECT COUNT(DISTINCT([SalesOrderNumber])) as DISTINCTCOUNT
- FROM [dbo].[FactResellerSalesXL_PageCompressed]
SQL Server Execution Times
CPU time = 3828 ms, elapsed time = 14281 ms.
- SELECT APPROX_COUNT_DISTINCT ( [SalesOrderNumber]) as APPROX_DISTINCTCOUNT
- FROM [dbo].[FactResellerSalesXL_PageCompressed]
SQL Server Execution Times
CPU time = 7390 ms, elapsed time = 4071 ms.
You can see the elapsed time is significantly lower! Great improvement using this new function.
The first time I did this, I did it wrong. A silly typo with a major result difference. So take a moment and learn from my mistake.
Note that I use COUNT(DISTINCT(SalesOrderNumber)) not DISTINCT COUNT (SalesOrderNumber ). This makes all the difference. If you do it wrong, the numbers will be way off as you can see from the below result set. You’ll also find that the APPROX_DISTINCTCOUNT will return much slower than the Distinct Count; which is not expected.
Remember COUNT(DISTINCT expression) evaluates the expression for each row in a group and returns the number of unique, non-null values, which is what APPROX_COUNT_DISTINCT does. DISTINCT COUNT (expression) just returns a row count of the expression, there is nothing DISTINCT about it.
Always fun tinkering with something new!