In the previous article, we learned the process to load data to SQL Pool Data Warehouse in Azure Synapse Analytics. In this article, we’ll get into details of finding the best way to load data into SQL Warehouse with different optimizations for better performance. We’ll learn about various aspects to take into concern when loading the data. Various practices are recommended and described in detail in this article with proper examples. Moreover, we’ll query and perform using Azure Data Studio to access the data warehouse we created in Azure.
This article is a part of the Azure Synapse Analytics Articles Series. You can check out other articles in the series from the following links.
- Azure Synapse Analytics
- Azure Synapse Analytics - Create Dedicated SQL Pool
- Azure Synapse Analytics - Creating Firewall at Server-level
- Azure Synapse Analytics - Connect, Query and Delete Data Warehouse SQL Pool
- Azure Synapse Analytics – Load Dataset to Warehouse from Azure Blob Storage
- Azure Synapse Analytics - Best Practices to Load Data into SQL Pool Data Warehouse
Data Preparation
Before we even start loading data from Azure Storage, one of the best things we can do is data preparation. It is vital we know the limitations of various export data format such as the ORC File Format. In order to reduce latency, first of all collocate the storage layer and the dedicated SQL Pool. Next, learn the way to work around the limitation of the file format. One of the finest ways to do that is to export subset of columns from data. We know that the limitation for Polybase is 1 million bytes of data i.e. we can’t load rows any more than 1 million bytes. Thus, any text files in Azure Data Lake Store and Azure Blob Storage should have lesser than 1 million bytes of data regardless of table schema. Next, we can also split it into small compressed files from large compressed files to make it work.
Manage Appropriate Compute to run Load
Running only one load at a time will provide the fastest loading speed. Besides, when it seems loading isn’t running at optimum speed make sure, minimal number of loads are running concurrently. Furthermore, when you require a large loading job, scaling up the dedicated SQL pool in Azure would be a better idea.
We can also create loading users specifically dedicated to running load. In order to load the run with appropriate compute resources, we can sign in as the loading user and run the load. This makes sure that the loading performs with the user’s resource class which is a far simple process than alternating resource class of a user in order to fit the need of the current resource.
Creating Loading User
For this, we require the creation of loading user. Let us learn to do it.
Step 1
Visit the Azure Portal and copy in the Server name for the Dedicated SQL Pool Data Warehouse.
Step 2
Next, in Azure Data Studio, Click on New Connection.
Step 3
Now, we fill in the Connection Details.
The Server name is the one that we copied earlier, the Authentication Type must be SQL Login. Fill in the user name and password you set up while creating the dedicated SQL Pool in Azure. Once done, click on Connect.
Step 4
Now, we can see the Dedicated SQL Pool has been connected to Azure Data Studio as adminojash. We can also see the master and our main data warehouse, ojash-pool-warehouse.
Step 5
Now, Right Click on SQL Pool and Click on Create New Query. A new empty query page will open.
Step 6
Under the master, write the following query.
-- Connect to master
CREATE LOGIN loader WITH PASSWORD = 'XnwQwe@21rpui()*&';
Step 7
Now, run the Query and the message will pop up as the command runs successfully.
With this, the loading user named as loader has now been created.
Loading New User
Step 8
Write in the following query to load in the warehouse with new user loader and hit run under your data warehouse name. Here it’s, ojash-pool-warehouse.
-- Connect to the dedicated SQL pool
CREATE USER loader FOR LOGIN loader;
GRANT ADMINISTER DATABASE BULK OPERATIONS TO loader;
GRANT INSERT ON [dbo].[DimDate] TO loader;
GRANT SELECT ON [dbo].[ProspectiveBuyer] TO loader;
GRANT CREATE TABLE TO loader;
GRANT ALTER ON SCHEMA::dbo TO loader;
CREATE WORKLOAD GROUP DataLoads
WITH (
MIN_PERCENTAGE_RESOURCE = 0
,CAP_PERCENTAGE_RESOURCE = 100
,REQUEST_MIN_RESOURCE_GRANT_PERCENT = 100
);
CREATE WORKLOAD CLASSIFIER [wgcELTLogin]
WITH (
WORKLOAD_GROUP = 'DataLoads'
,MEMBERNAME = 'loader'
);
Step 9
Now, Right Click on SQL Pool and Click on New Connection.
Step 10
Fill in the Connection Details. We’ll fill the details as earlier for Connection Type, Server, and Authentication Type.
For user name and password, we fill in the name ‘loader’ that we set just recently with the password we set up in the query. Make sure you write in the data warehouse name here under database. Here it’s ojash-pool-warehouse.
Once filled, Click on Connect.
Step 11
Now, we can see, the data warehouse has been connected with the user ‘loader’. Previously it was as adminojash.
Managing Access Control for multiple users
Quite often we’ll set up multiple users to load data into the data warehouse. When querying to CREATE TABLE, it would require CONTROL Permission to the database and thus giving away the access of control of all schemas. This might not be an ideal approach. It is essential to limit these permissions using the DENY CONTROL Statement. We can perform it with SQL Commands as follows.
DENY CONTROL ON SCHEMA :: schema_A TO loader_B;
DENY CONTROL ON SCHEMA :: schema_B TO loader_A;
Staging Table
Staging Table is simply a temporary table which contains the business data and its primary purpose is to stage the incremental data from the transactional data. In order to achieve the fastest possible loading speed, load the data into staging table while moving data to data warehouse table. For optimizing, we can define the staging table as a heap and then use the distribution options as round-robin. It is fundamental to understand this is a two-step process where we first load the data into the staging table and then insert them into the production data warehouse table. Define staging table as hash if the production table uses hash distribution. As these steps are followed, you’ll observe that some time would have been taken to load the staging table but no data movement will incur in distribution while inserting to production table making it rapid.
Columnstore Index
When we talk about the standard for querying and storing huge data warehousing fact tables – columstore index is the one. This index can help us achieve over 10 times the query performance compared to traditional storage.
The columnstore index requires a huge amount of memory in order to compress data into high-quality row groups and to perform best possible compression and attain index efficiency, the columnstore index would need to compress 1,048,576 rows into each of the rowgroup. This high demand might now be able to be fulfilled during memory pressure effecting query performance. To make sure such instances don’t arise, we need to take some steps. We can ensure this doesn’t occur by confirming loading user has enough memory to perform and achieve maximum compression rate by using the loading user that are the member of medium and large resource class. Also, make sure to load enough rows to fill the new rowgroups completely which will mitigate the data going to deltastore.
Batch Size
While using SQLBulkCopy API and BCP, we should always consider a way to attain better throughput. This can be achieved by simply increasing the Batch Size. It is highly recommended to use the batch size of around 100K to 1Million rows to optimal performance.
Statistics Creation
After the loading of data is done, it is highly recommended to create statistics to improve the performance of query. We can perform this manually or also enable the auto-creation of statistics. The following query will manually create the statistics for the four columns of DimSalesTerritory table.
create statistics [SalesTerritoryKey] on [DimSalesTerritory] ([SalesTerritoryKey]);
create statistics [SalesTerritoryRegion] on [DimSalesTerritory] ([SalesTerritoryRegion]);
create statistics [SalesTerritoryCountry] on [DimSalesTerritory] ([SalesTerritoryCountry]);
create statistics [SalesTerritoryGroup] on [DimSalesTerritory] ([SalesTerritoryGroup]]);
Here we can see, on Azure Data Studio, there is only a clusteredindex under the DimSalesTerritory Table.
Once the query is performed, we can see the four new Statistics created. We can now see the new SalesTerritoryCountry, SalesTerritoryGroup, SalesTerritoryKey, and SalesTerritoryRegion statistics.
Conclusion
Thus, in this article, we learned various practices to better the performance while loading data to our datawarehouse table. We learned about Data Preparation, Managing Compute for Loading, Creating Loading User, Connecting Datawarehouse with the new created user, Managing access control and limiting the control for different users, about staging table and columnindex store. Moreover, we also learned about increasing Batch Size for using SQLBulkCopy API and learned the need to created statistics for better performance.