Performance Tuning in Database Systems

John Charles Olamendy
13y
18.1k
0
0

Article

Introduction

Performance is a very important topic in database systems, and it's highly related to well designed queries.

Microsoft SQL Server

Let's start this section by viewing execution plans (in graphical-mode and plain text) to fix a poorly performing query. For the examples, we're going to use AdventureWorks database shipped with the installation of Microsoft SQL Server. To illustrate the estimated and actual execution plan, let's suppose that we're developing an application which consumes transaction information about products by using a complex and long duration SQL query (see Listing 1).

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name],
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID;

Listing 1.

Now let's display the estimated query execution plan and analyze what operations SQL Server will perform without executing the query. The icons in the Figure 1 represent the operations to perform to generate the result set. Related to the concepts of performance tuning in SQL Server, there are two types of operations: logical and physical. Logical operations are related to relational operations such as INNER JOIN and physical operations are the actual implementation of logical operations; that's, the implementation of INNER JOIN is a Hash Match to physically join two result sets.

Figure 1.

The first operation is an Index Scan to read the index AK_Product_Name on the table Production.Product and get the list of product names and identifiers (see Figure 2).

Figure 2.

The other operation is a Cluster Index Scan to read the index PK_TransactionHistory_TransactionID on the table Production.TransactionHistory and get data related to the transaction history(see 3).

Figure 3.

Next operation is a join of the previous result sets (see Figure 4).

Figure 4.

And the final operation is the select operation. You can see that each operation has an associated cost. It's remarkable to say that in addition to the estimated execution plan, we can also display the actual execution plan.

Now let's continue with a test of the execution plan. Let's add a WHERE filter to the query (Listing 2) on TransactionID.

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name]
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID
where th.TransactionID between 100000 and 100060;

Listing 2

Then estimate the execution plan, and you will see that the operation, in this case, has changed from a Clustered Index Scan to Clustered Index Seek. A scan retrieves all the rows and a seek retrieves only selective rows (see Figure 5).

Figure 5.

Let's re-write the query to sort the transaction history by the dates (see Listing 3).

Listing 3

And the next execution plan contains an external sort operation (see Figure 6 ).

Figure 6.

Now let me describe a simple two-step process to tune the performance when executing several SQL statements against the database system. The first step is to identify the most costly statement in the sequence of statements. The second step is to focus on the most costly operation within the costly identified statement.

Now let's see how to get detailed information concerning the execution plan using the T-SQL statements: SET SHOWPLAN_ALL and SET SHOWPLAN_TEXT. The SET SHOWPLAN_ALL returns a report of the query execution plan in a tabular form with multiple rows and columns. The SET SHOWPLAN_TEXT returns a report in a single column with multiple rows for each operation. Let's execute the complex query in Listing 3 using the SET SHOWPLAN_ALL commands to see the execution plan (see Listing 4).

set showplan_all on
go

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name]
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID
where th.TransactionID between 100000 and 100060
order by th.TransactionDate;
go

set showplan_all off
go

Listing 4

And the result resembles the Figure 7.

Figure 7.

In addition to the SHOWPLAN commands, SQL Server 2005 provides two important commands which are used to get query and batch execution statistics: SET STATISTICS IO, and SET STATISTICS TIME. SET STATISTICS IO returns disk activity to return a result set. SET STATISTICS TIME returns the number of milliseconds to parse, compile, and execute SQL statements. Let's test these two commands by using the complex query in Listing 3. In the first example, SET STATISTICS IO is enable to report where IO hot spots are occurring (see Listing 5).

set statistics io on
go

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name]
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID
where th.TransactionID between 100000 and 100060
order by th.TransactionDate;
go

set statistics io off
go

Listing 5

And the result is shown in Listing 6.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Product'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TransactionHistory'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Listing 6

You can see that physical read values (reads from the disk) and logical read values (reads from the cache). You can see a pseudo-table named Worktable in the report which is associated to GROUP BY, ORDER BY, hash joins, and UNION operations. This table is created on the tempdb database for the duration of the query, and it's automatically removed when the SQL statement finishes.

Now let's see an example with the SET STATISTICS TIME command (see Listing 7).

set statistics time on
go

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name]
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID
where th.TransactionID between 100000 and 100060
order by th.TransactionDate;
go

set statistics time off
go

Listing 7

And the underlying report is shown in Listing 8.

SQL Server Execution Times:
CPU time = 10 ms, elapsed time = 150 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.

Listing 8

In this case, the parse and compile time is 0 because this plan was stored in the database system cache. But you can measure the amount of time the query takes to execute. To parse and compile the query again, you need to include the dbcc freeproccache command (see Listing 9).

dbcc freeproccache
go

set statistics time on
go

select th.TransactionID, th.TransactionDate, th.Quantity, pr.[Name]
from Production.TransactionHistory th inner join Production.Product pr
on th.ProductID=pr.ProductID
where th.TransactionID between 100000 and 100060
order by th.TransactionDate;
go

set statistics time off
go

Listing 9

Now we're able to see the parse and compile time (see Listing 10).

DBCC execution completed. If DBCC printed error messages, contact your system administrator.
SQL Server parse and compile time:
CPU time = 10 ms, elapsed time = 14 ms.

(61 row(s) affected)

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 123 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.

Listing 10

Now let's discuss how statistics are used for efficient query processing and performance, allowing the database systems to correctly choose the physical operations when generating an execution plan. This statistics on table and view are manually and automatically generated and used to create efficient execution plans. The statistics enable the database system to choose from various auxiliary structures such as indexes taking into account the actual content of the database object which might be used in the execution plan.

I'll show how to work directly with statistics in SQL Server, but it's remarkable to say that SQL Server manages automatically the creation and update of statistics whenever possible. So, you should use these T-SQL statements for especial scenarios, for example, when you change a significant amount of data.

To manually create statistics, you need to use the CREATE STATISTICS statement (Listing 11).

CREATE STATISTICS statistics_name
ON {table|view}(column[,…n])
[WITH [FULLSCAN|SAMPLE number {PERCENT|ROWS}|STATS_STREAM][NORECOMPUTE]]

Listing 11

Let's see an example by creating the statistics on the Color attribute of the Production.Product table (see Listing 12).

create statistics stats_product_color
on Production.Product(Color)
with fullscan;

Listing 12

Whenever you want to update the new statistics, you need to use the UPDATE STATISTICS statement. For example, if you have inserted a lot of new rows and updated others and you want to update the statistics stats_product_color, you run the following statement (see Listing 13).

update statistics Production.Product
stats_product_color
with fullscan;

Listing 13

In the case, you want to generate statistics across all the tables in a database (not running the CREATE STATISTICS and UPDATE STATISTICS statements for each table), you have to use the sp_createstats system stored procedure.

Once you have generated the statistics, you may want to see the statistics. You can use the DBCC SHOW_STATISTICS command (see Listing 14).

DBCC SHOW_STATISTICS ('table_name'|'view_name', statistics_name)

Listing 14

For example, to see the statistics stats_product_color, you have to run the following command (see Listing 15).

dbcc show_statistics('Production.Product',stats_product_color);

Listing 15

The results of this operation is shown in Listing 16.
Name Updated Rows Rows Sampled Steps Density Average key length String Index
-------------------------------------------------------------------------------------------------------------------------------- -------------------- -------------------- -------------------- ------ ------------- ------------------ ------------
stats_product_color Oct 10 2008 10:09PM 504 504 9 0 5.178571 YES

(1 row(s) affected)

All density Average Length Columns
------------- -------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0.1 5.178571 Color

(1 row(s) affected)

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------- ------------- ------------- -------------------- --------------
Black 0 93 0 1
Blue 0 26 0 1
Grey 0 1 0 1
Multi 0 8 0 1
Red 0 38 0 1
Silver 0 43 0 1
Silver/Black 0 7 0 1
White 0 4 0 1
Yellow 0 36 0 1

Listing 16

Finally, if you want to remove your own statistics, you have to use the DROP STATISTICS command (see Listing 17).

drop statistics Production.Product.stats_product_color;

Listing 17

Oracle database systems

Now let's discuss the principles and techniques of performance tuning in Oracle database. Initially, the Oracle database used a rule-based optimizer but this type of optimizer has some drawbacks, and in version 7, Oracle introduced the cost-based optimizer. The Oracle optimizer has two basic modes of operation: the FIRST_ROWS mode and the ALL_ROWS mode. The FIRST_ROWS mode's goal is to get the information back to the application as fast as possible. The ALL_ROWS mode's goal is to complete all the work as fast as possible. You can set the optimizer mode using the OPTIMIZER_MODE initialization parameter for a session (see Listing 18).

ALTER SESSION SET OPTIMIZER_MODE=mode

Listing 18

The same way as in SQL Server, Oracle database defines three basic types of operations that can be part of the execution plan: data access, join operations and other operations such as sorting, aggregation and windowing.

Data access can be achieved in two ways: By accessing the data blocks in a table directly or by using an index to retrieve data through pointers in the index. As well as there are various ways to access a cluster and noncluster table. The main physical operations in data access are:

Full table scan. This operation scans the entire table.

Partition. This operation performs one or more access operations for each partition needed in the query.

ROWID. This operation is the fastest way to access data, because ROWID is an identifier which includes information to go directly to the row inside a particular data block. Indexes include a ROWID for each row represented by the index entry, thus this operation follows the use of indexes.

Cluster. This operation enables access to rows in a cluster table.

In the case of accessing data through an index, you can have several operations as well:

Unique scan. Uses the index to retrieve a single row.
Range scan. Uses the index to retrieve the ROWID for a range of index values.
Full scan. This operation performs a full scan on the index.
Fast full scan. This operation performs a full scan with multi block reads.
Skip scan. This operation is used if the query uses the later columns of the index without using the leading columns, the Oracle can skip some index blocks if the leading columns contain the same value. For example, if an index contains a search key on the columns x,y,z of a table, and a query uses a selection criterion of z={some value}; the index skip scan would look for the first value of column x and y, and then see if there was a value for column z equal to {some value}. If the entry was not found, the Oracle could skip index blocks where the values of columns x and y are identical.
Bitmap. This operation uses a bitmap index.
Partition. It's similar to the partition operation on tables.

And finally, one important operation is the join operations. Joins work by taking a row from one table, called the driver table, and checking the value of the join key with another table. The Oracle optimizer will try and pick the driver table that will produce the smallest number of rows to reduce the overhead. There are several physical operations such as hash join.

Now let's talk about statistics in Oracle. The Oracle database gathers various statistics on database objects. For columns, the statistics are: number of distinct values for the column, average length of the column. For indexes, the statistics are: number of rows, number of distinct values, height of the index, number of leaf blocks and etc. For tables, the statistics are: number of rows, average of row length and the total number of blocks.

To generate statistics on database objects in Oracle, you have to use the built-in package DBMS_STATS. For example, to gather statistics on objects in a schema, you have to use the GATHER_SCHEMA_STATS procedure. If you are logged into the system as SCOTT, you can gather statistics on the database objects on the schema SCOTT as shown in Listing 19.

exec dbms_stats.gather_schema_stats(ownname=>'scott');

Listing 19

If you want to collect statistics for a particular table, you must use the dbms_stats.gather_table_stats. For example, to gather statistics associated to the emp table on the SCOTT schema (see Listing 20).

begin
dbms_stats.gather_table_stats(user,'emp');
end;
/

Listing 20

Collecting statistics is a task of database administrators, and typically statistics are collected at non-peak times in order not to overload the system. Oracle 10g automatically collects statistics during the maintenance windows, which is by default between 10 P.M. and 6 A.M. every day.

To illustrate with examples, we're going to use database objects within the SCOTT schema in the default ORCL database. To get information concerning execution plan, you have to use the autotrace capability which turns on automatic tracing of SQL statements as well as enables displays information of the execution plan (see Listing 21).

set autotrace on;

Listing 21

The EXPLAIN PLAN functionality is the basis of the AUTOTRACE, because it's used to say the Oracle optimizer to persist the results of its execution plan in a table called PLAN_TABLE in the user's schema. The syntax is as shown in Listing 22.

EXPLAIN PLAN FOR sql_statement;

Listing 22

Now let's see some examples. To test the examples, you must be logged in using a user with privileges to gather statistics and display execution plans. We're going to base our examples on tables within the SCOTT schema. The first example will create a table based on the SCOTT.emp table, then create an associated index and execute a query that uses this index to see the execution plan (see Listing 23).

create table tb_emp
as select *
from scott.emp;
create unique index tb_emp_uidx on tb_emp(empno,ename);
begin
dbms_stats.gather_table_stats(user,'tb_emp');
end;
set autotrace on;
select *
from tb_emp
where empno=7788;

Listing 23

Sometimes it is tricky to understand the execution plan by selecting all the columns from the plan_table table. Thus, we need to use the built-in package DBMS_XPLAN (see Listing 24).

delete from plan_table;
explain plan for select * from scott.emp;
select * from table(dbms_xplan.display);

Listing 24

Conclusion

In this article, we've covered the principles and techniques you can use to help troubleshoot and evaluate a query performance by illustrating examples in Microsoft SQL Server and Oracle database. Now you can apply these techniques to your own scenario.