In continuation to our previous article on Azure Synapse Analytics, we will deep dive into the sharding patterns(distributions) that are used in the Dedicated SQL Pool. In the background, the Dedicate SQL Pool divides work into 60 smaller queries which will be run in parallel on your compute node. You will define the distribution method while creating the table or the ROUND-ROBIN distribution will be selected as a default if you fail to select anything.
There are three types of distribution present
- HASH
- ROUND-ROBIN
- REPLICATED TABLES
HASH Distribution
Hash distribution is whenever the data is stored into the compute nodes from the table, there is an element called ‘Hash Function’ which takes over the responsibility to decide which row should be stored in which node. It is the decider which determines the pattern of storing all of the table rows. The Hash distribution is the very common and go-to method if you want highest query performance when querying large tables for joins and aggregations. In the background the Hash function utilizes the values of the declared distribution column to assign each row to the compute nodes.
ROUND-ROBIN Distribution
Round robin distribution is commonly used when using as a staging table for loads and is very simple type. It works in a circular fashion and all the table rows will be placed into each nodes in a sequential pattern. It is very quick to load data into a Round Robin table but performance of the query will be better with Hash distributed tables. The reason is due to the joins which requires reshuffling of the data, hence the additional time taken for throwing results out.
CREATE TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
(
{ column_name <data_type> [ <column_options> ] } [ ,...n ]
)
[ WITH ( <table_option> [ ,...n ] ) ]
{
DISTRIBUTION = HASH (enter_distribution_column_name)
| DISTRIBUTION = ROUND_ROBIN -- default for Azure Synapse Analytics
| DISTRIBUTION = REPLICATE -- default for Parallel Data Warehouse
}
REPLICATED TABLES Distribution
If you are wondering if there are any methods that could help to deal with small or medium tables as practically not all the table we are going to store need to be humungous. REPLICATED TABLES provides the quickest and best query performance when it comes to working with smaller tables. It does this by caching a full copy of the table on each compute node which avoids the need for data transfer among the nodes before a join or aggregation. It is commonly best utilized with smaller tables but there will be extra storage required and there is an additional overhead that has to be incurred when writing the data which is why it is not advised to be used larger tables.
Summary
This is a continuation of my previous article, Azure Synapse Analytics architecture. Both the articles explains how the basic row data from the tables are stored into the storage and how the user can manipulate it to get better performance.
Reference:
Microsoft official documentation