SQL Performance (2), Optimized SELECT Query (B)

This series of articles will discuss SQL server performance. In the title, I use SQL Performance because the articles could include SQL statement (Query) optimization and also include SQL Server performance issues, and probably include stored procedure performance issues.

A: Introduction

 
According to Pareto principle’s 80/20 rule, we put 20% effort in article 1, SQL Performance (1), Optimized SELECT Query (A), to discuss the minimum actions we need to do for SQL performance tuning SELECT statements, and hopefully to get 80% performance improvement; while in this article, we will discuss the 80% effort to get another 20% performance improvement.
 
We will discuss with 12 tips below.
 

B: Tips for Improvement of Tuning Query

  1. Create joins with INNER JOIN (not WHERE --- Cross Join)
  2. Avoid Multiple Joins in a Single Query
  3. Avoid using table variables in joins
    • Using Temp Table wisely
    • Using a Subquery
    • Using a CTE
  4. Avoid Correlated SQL Subqueries
  5. Avoid Multi-Statement Table-Valued Functions (TVFs), instead using inline TVFs
  6. Avoid using wildcard characters at the beginning of the LIKE pattern
  7. Avoid using COUNT(), instead of Exist()
  8. Avoid Using GROUP BY, ORDER BY, and DISTINCT
  9. Avoid Different Datatype on JOIN and WHERE Conditions
  10. Avoid using HAVING to define a filter, instead of WHERE
  11. Avoid OR conditions
  12. Avoid sorting with a mixed order
1. Create joins with INNER JOIN (not WHERE --- Cross Join)[ref]
 
Some SQL developers, such as me, starting from Oracle, used to make joins with WHERE clauses, such as the following,
SELECT Customers.CustomerID, Customers.Name, Sales.LastSaleDate  
FROM Customers, Sales  
WHERE Customers.CustomerID = Sales.CustomerID 

This type of join creates a Cartesian Join, also called a Cartesian Product or CROSS JOIN.

In a Cartesian Join, all possible combinations of the variables are created. In this example, if we had 1,000 customers with 1,000 total sales, the query would first generate 1,000,000 results, then filter for the 1,000 records where CustomerID is correctly joined. This is an inefficient use of database resources, as the database has done 100x more work than required. Cartesian Joins are especially problematic in large-scale databases because a Cartesian Join of two large tables could create billions or trillions of results.
 
To prevent creating a Cartesian Join, use INNER JOIN instead,
SELECT Customers.CustomerID, Customers.Name, Sales.LastSaleDate  
FROM Customers INNER JOIN Sales  
ON Customers.CustomerID = Sales.CustomerID 

The database would only generate the 1,000 desired records where CustomerID is equal.

Note
Some DBMS systems are able to recognize WHERE joins and automatically run them as INNER JOINs instead. In those DBMS systems, there will be no difference in performance between a WHERE join and INNER JOIN. However, INNER JOIN is recognized by all DBMS systems.
 
I check SQL Server 2019 that belongs to this category.
 
2. Avoid Multiple Joins in a Single Query[ref]
 
Making joins less complicated.
 
Try to avoid writing a query using multiple joins that include outer joins, cross apply, outer apply, and other complex subqueries.
 
Reason
 
When the joins are complex, the number of possible execution plans exponentially increases. This, in turn, will reduce the choices for the query optimizer to decide the join order and join type. Sometimes, the query optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or subqueries. Also, due to time and performance constraint, SQL Engine will try to generate the “good enough” execution plan versus trying to generate the best possible execution plan.
 
What would address your question is the subject JOIN DECOMPOSITION.
 
You can decompose a join by running multiple single-table queries instead of a multi-table join, and then performing the join in the application. For example, instead of this single query,
SELECT * FROM tag  
JOIN tag_post ON tag_post.tag_id = tag.id  
JOIN post ON tag_post.post_id = post.id  
WHERE tag.tag = 'mysql'; 

You might run these queries,

SELECT * FROM tag WHERE tag = 'mysql';  
SELECT * FROM tag_post WHERE tag_id=1234;  
SELECT * FROM post WHERE post.id IN (123,456,567,9098,8904); 

Why do this? It looks wasteful at first glance because you've increased the number of queries without getting anything in return. However, such restructuring can actually give significant performance advantages,

  • Caching can be more efficient. Many applications cache "objects" that map directly to tables. In this example, if the object with the tag mysql is already cached, the application will skip the first query. If you find posts with an ID of 123, 567, or 908 in the cache, you can remove them from the IN() list. The query cache might also benefit from this strategy. If only one of the tables changes frequently, decomposing a join can reduce the number of cache invalidations.
  • Executing the queries individually can sometimes reduce lock contention
  • The queries themselves can be more efficient. In this example, using a IN() list instead of a join lets MySQL sort row IDs and retrieve rows more optimally than might be possible with a join.
  • You can reduce redundant row accesses. Doing a join in the application means retrieving each row only once., whereas a join in the query is essentially denormalization that might repeatedly access the same data. For the same reason, such restructuring might also reduce the total network traffic and memory usage.
  • To some extent, you can view this technique as manually implementing a hash join instead of the nested loops algorithm MySQL uses to execute a join. A hash join might be more efficient.
As a result, doings joins in the application can be more efficient when you cache and reuse a lot of data from earlier queries, you distribute data across multiple servers, you replace joins with IN() lists or a join refers to the same table multiple times.
 
Note
Another example can be seen here.
 
3. Avoid using table variables in joins[ref]
 
Use temporary tables, CTEs (Common Table Expressions), or derived tables in joins instead.
 
Reason
 
Even though table variables are very fast and efficient in a lot of situations, the SQL Server engine sees it as a single row. Due to this reason, SQL will produce an execution plan that will perform horribly when used in joins.
 
a. Using Temp Table wisely[ref]
 
Temporary tables usually increase a query’s complexity. If your code can be written in a simple, straightforward manner when you have to join a large table and there are conditions on said table, you can increase database performance by transferring your data in a temp table, and then making a join on that. Your temp table will have fewer rows than the original (large) table, so the join will finish faster!
 
The decision isn’t always straightforward, but this example will give you a sense of situations in which you might want to use temp tables,
 
Imagine a customer table with millions of records. You have to make a join on a specific region. You can achieve this by using a SELECT INTO statement and then joining with the temp table,
SELECT * INTO #Temp FROM Customer WHERE RegionID = 5  
SELECT r.RegionName, t.Name FROM Region r JOIN #Temp t ON t.RegionID = r.RegionID 

(Note: some SQL developers also avoid using SELECT INTO to create temp tables, saying that this command locks the tempdb database, disallowing other users from creating temp tables. Fortunately, this is fixed in 7.0 and later.)

b. Using a Subquery[ref]
 
As an alternative to temp tables, you might consider using a subquery as a table,
SELECT r.RegionName, t.Name FROM Region r   
JOIN (SELECT Name, RegionID FROM Customer WHERE RegionID = 5) AS t   
ON t.RegionID = r.RegionID 

All of these SQL snippets will return the same data. But with temp tables, we could, for example, create an index in the temp table to improve performance. There’s some good discussion here on the differences between temporary tables and subqueries.

c. Using a CTE[ref]
 
Another option, you might consider using a CTE as a table,
WITH Employee_CTE (EmployeeNumber, Title)  
AS  
(SELECT NationalIDNumber,  
        JobTitle  
 FROM   HumanResources.Employee)  
SELECT EmployeeNumber,  
       Title  
FROM   Employee_CTE 

Common Table Expression - Parts

CTE Query Definition
 
The blue portion is the CTE. Notice it contains a query that can be run on its own in SQL.
 
You can define more than one CTE within a WITH statement. This can help you simplify some very complicated queries which are ultimately joined together. Each complicated piece can include in its own CTE which is then referred to and joined outside the WITH clause.
 
Here is an example using of TWO CTE’s, it’s a simple example, but it shows how two CTE’s are defined, and then used in an INNER JOIN,
 
SQL Performance (2), Optimized SELECT Query (B)
 
The first common table expression is colored green, the second blue. As you can see from the SELECT statement the CTE’s are joined as if they were tables. Hopefully, you can see that as your queries become more complicated, CTE’s can become a really useful way to separate operations; therefore, simplify your final query.
 
4. Avoid Correlated SQL Subqueries[ref]
 
A correlated subquery is one that uses values from the parent query. This kind of SQL query tends to run row-by-row, once for each row returned by the outer query, and thus decreases SQL query performance. New SQL developers are often caught structuring their queries in this way—because it’s usually the easy route.
 
Here’s an example of a correlated subquery,
SELECT c.Name,   
       c.City,  
       (SELECT CompanyName FROM Company WHERE ID = c.CompanyID) AS CompanyName   
FROM Customer c 

In particular, the problem is that the inner query (SELECT CompanyName…) is run for each row returned by the outer query (SELECT c.Name…). But why go over them Company again and again for every row processed by the outer query?

A more efficient SQL performance tuning technique would be to refactor the correlated subquery as a join,
SELECT c.Name,   
       c.City,   
       co.CompanyName   
FROM Customer c   
    LEFT JOIN Company co  
        ON c.CompanyID = co.CompanyID 

In this case, we go over the Company the table just once, at the start, and JOIN it with the Customer table. From then on, we can select the values we need (co.CompanyName) more efficiently.

5. Avoid Multi-Statement Table-Valued Functions (TVFs)[ref]
 
Multi-statement TVFs are more costly than inline TVFs.
 
Reason
 
SQL Server expands inline TVFs into the main query like it expands views, but evaluates multi-statement TVFs in a separate context from the main query and materializes the results of multi-statements into temporary work tables. The separate context and work table make multi-statement TVFs costly.
 
6. Avoid using wildcard characters at the beginning of the LIKE pattern[ref]
 
In order to fine-tune your SQL queries, you must avoid using the LIKE pattern in the following manner,
SELECT* FROM Customers WHERE address LIKE ‘%bar%’; 

Here, the database will not be able to use a suitable index if it exists because of % wildcard. The system starts by performing a full table scan and this takes a toll on its speed.

Thus, the better way to write this query is,
SELECT* FROM Customers WHERE address LIKE ‘bar%’; 

7.  Avoid using COUNT(), instead of Exist()[ref]

This SQL optimization technique concerns the use of EXISTS(). If you want to check if a record exists, use EXISTS(),
IF EXISTS(SELECT FIRSTNAME FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%')  
    PRINT 'YES' 

instead of COUNT(),

IF (SELECT COUNT(1) FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%') > 0  
    PRINT 'YES' 

While COUNT() scans the entire table, counting up all entries matching your condition, EXISTS() will exit as soon as it sees the result it needs.

8. Avoid Using GROUP BY, ORDER BY, and DISTINCT[ref]
 
Avoid using GROUP BY, ORDER BY, and DISTINCT whenever possible.
 
Reason
 
When using GROUP BY, ORDER BY, or DISTINCT, the SQL Server engine creates a work table and puts the data on the work table. After that, it organizes this data in the work table as requested by the query, and then it returns the final result.
 
9. Avoid Different Datatype on JOIN and WHERE Conditions[ref][ref]
 
This is easier said than done depending on your permissions to make changes to the schema.
 
Reason
 
When joining or comparing two fields with different datatypes, SQL must do an on-the-fly conversion of the field before it can do a comparison, even if the fields are indexed. If mismatched datatypes are unavoidable, try to cast the larger datatype to the smaller datatype whenever possible.
 
Consider this query, which selects the number of red fruits in a forest.
SELECT  
    COUNT(*)  
FROM  
    forest  
WHERE  
    fruit_color = 5;   /* 5 = red */ 

Assuming the column fruit_color's type is VARCHAR, or just anything non-numeric, indexing that column won't be very helpful, as the required implicit cast will prevent the database from using the index for the filtering process.

10.  Avoid using HAVING to define a filter, instead of WHERE[ref]
 
This is easier said than done depending on your permissions to make changes to the schema.
 
The goal of an efficient query is to pull only the required records from the database. Per the SQL Order of Operations, HAVING statements are calculated after WHERE statements. If the intent is to filter a query based on conditions, a WHERE statement is more efficient.
 
For example, let’s assume 200 sales have been made in the year 2016, and we want to query for the number of sales per customer in 2016.
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)  
FROM Customers  
   INNER JOIN Sales  
   ON Customers.CustomerID = Sales.CustomerID  
GROUP BY Customers.CustomerID, Customers.Name  
HAVING Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016# 

This query would pull 1,000 sales records from the Sales table, then filter for the 200 records generated in the year 2016, and finally count the records in the dataset.

In comparison, WHERE clauses limit the number of records pulled,
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)  
FROM Customers  
  INNER JOIN Sales  
  ON Customers.CustomerID = Sales.CustomerID  
WHERE Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016#  
GROUP BY Customers.CustomerID, Customers.Name 

This query would pull the 200 records from the year 2016, and then count the records in the dataset. The first step in the HAVING clause has been completely eliminated.

HAVING should only be used when filtering on an aggregated field. In the query above, we could additionally filter for customers with greater than 5 sales using a HAVING statement.
SELECT Customers.CustomerID, Customers.Name, Count(Sales.SalesID)  
FROM Customers  
   INNER JOIN Sales  
   ON Customers.CustomerID = Sales.CustomerID  
WHERE Sales.LastSaleDate BETWEEN #1/1/2016# AND #12/31/2016#  
GROUP BY Customers.CustomerID, Customers.Name  
HAVING Count(Sales.SalesID) > 5 

11.  Avoid OR conditions[ref]

Consider this query, which selects the number of posts on Facebook posted after new year's eve, or posted by a user named Mark.
SELECT  
    COUNT(*)  
FROM  
    fb_posts  
WHERE  
    username = ‘Mark’  
        OR post_time > ‘2018-01-01’ 

Having an index on both the username and post_time columns might sound helpful, but in most cases, the database won't use it, at least not in full. The reason will be the connection between the two conditions - the OR operator, which makes the database fetch the results of each part of the condition separately.

An alternative way to look at this query can be to 'split' the OR condition and 'combine' it using a UNION clause. This alternative will allow you to index each of the conditions separately, so the database will use the indexes to search for the results and then combine the results with the UNION clause.
SELECT …  
FROM …  
WHERE username = ‘Mark’  
    UNION  
SELECT …  
FROM …  
WHERE post_time > ‘2018-01-01’ 

Please note that if you don't mind duplicate records in your result set, you can also use UNION ALL (which will perform better than the default UNION DISTINCT).

12. Avoid sorting with a mixed order[ref]
 
Consider this query, which selects all posts from Facebook and sorts them by the username in ascending order, and then by the post date in descending order.
SELECT  
    username, post_type  
FROM  
    fb_posts  
ORDER BY username ASC , post_type DESC 

MySQL (and so many other relational databases), cannot use indexes when sorting with a mixed order (both ASC and DESC in the same ORDER BY clause). This changed with the release of the reversed indexes functionality and MySQL 8.x.

 

Summary

 
These tips include the most popular tuning SELECT issues.
References

 

 


Similar Articles