Introduction
A data warehouse acts as a centralized data repository, integrating data from multiple diverse sources. It stores both current and historical data, facilitating data reporting and analysis. Data is periodically transferred from various essential business information sources into a data warehouse, undergoing processes like formatting, cleaning, validation, summarization, and reorganization. Regardless of the method used, the data warehouse serves as a permanent data store for reporting, analysis, and business intelligence (BI).
Example of a BI architecture [Source]
ETL vs ELT: which end-to-end data warehouse solution to opt?
The end-to-end data warehouse architectures on Azure include:
- Enterprise BI in Azure with Azure Synapse Analytics: This architecture implements an extract, load, and transform (ELT) pipeline that moves data from an on-premises SQL Server database into Azure Synapse.
- Automated enterprise BI with Azure Synapse and Azure Data Factory: This reference architecture demonstrates an ELT pipeline with incremental loading, automated using Azure Data Factory.
Data warehouses simplify complex data from operational systems, facilitating easier comprehension. They provide flexibility in using meaningful column names, restructuring schema to simplify relationships, and consolidating several tables into one. This facilitates BI system users to create reports and analyze data without requiring DBA or data developer assistance.
Data warehouses separate historical data from source transaction systems to optimize performance. They allow easy access to historical data from various locations by providing a centralized location with common formats, keys, and data models. As data warehouses optimize for read access, they speed up report generation compared to using source transaction systems. They can store historical data from multiple sources, clean up data during import, and ensure the transactional system focuses on handling writes while the data warehouse manages most read requests. Data warehouses also enhance security, provide tools for discovering hidden patterns in data, and facilitate the creation of business intelligence solutions such as OLAP cubes.
Overcoming Challenges in Configuring a Business-Oriented Data Warehouse
Setting up a data warehouse presents challenges like modeling business concepts, setting up data orchestration, and maintaining or improving data quality. The data could be stored in various storage mediums, by the data warehouse itself, or in a relational database such as Azure SQL Database. The analytical data store layer's purpose is to handle queries from analytics and reporting tools. Azure Synapse or Azure HDInsight using Hive or Interactive Query can meet this requirement in Azure. Orchestration to move or copy data from data storage to the data warehouse can be achieved using Azure Data Factory or Oozie on Azure HDInsight.
Data warehouse implementation in Azure can follow two categories, symmetric multiprocessing (SMP) and massively parallel processing (MPP). The choice depends on data sizes, workload patterns, and organization infrastructure. Data size and workload complexity determine whether an SMP or MPP solution is more suitable.
The choice of SMP or MPP solution requires considering factors like managed service requirements, data set size and complexity, the structured or unstructured nature of data, separation of historical data from operational data, integration of multiple data sources, multitenancy requirement, relational data store preference, real-time reporting requirements, and user concurrency needs.
What do the experts say?
To help narrow down your choices, the experts recommend that you answer the following questions:
- Are you interested in a managed service, or do you prefer to manage your own servers?
- Are you dealing with large data sets or complex, long-running queries? If so, consider using an MPP option.
- For a large data set, is the data source structured or unstructured? If it's unstructured, consider using a big data environment like Spark on HDInsight, Azure Databricks, Hive LLAP on HDInsight, or Azure Data Lake Analytics. These can function as ELT and ETL engines, transforming the data into a structured format, which can then be loaded into Azure Synapse or other options. For structured data with compute-intensive workloads, consider using Azure Synapse's "Optimized for Compute" performance tier.
- Do you need to separate your historical data from current operational data? If yes, choose an option that requires orchestration, as these standalone warehouses are optimized for heavy read access and serve as separate historical data stores.
- Will you integrate data from multiple sources beyond your OLTP data store? If so, look for options that easily support multiple data sources.
- Do you require a multitenancy setup? If yes, note that Azure Synapse might not be suitable, and you should explore other options. Refer to Azure Synapse Patterns and Anti-Patterns for more details.
- Are you specifically looking for a relational data store? If yes, select an option with a relational data store. However, remember that you can use PolyBase to query non-relational data stores if necessary but perform performance tests on unstructured data sets for your workload.
- Do you have real-time reporting needs? If you need rapid query response times for high volumes of singleton inserts, choose an option that supports real-time reporting.
- Are you expecting to support many concurrent users and connections? The ability to handle concurrent users/connections depends on various factors, and you may need to consider resource limits based on the service tier for Azure SQL Database or VM size for SQL Server. Azure Synapse also limits concurrent queries and connections, so consider using complementary services like Azure Analysis Services to overcome these limits if needed.
- What type of workload do you have? MPP-based warehouse solutions are well-suited for analytical, batch-oriented workloads. For transactional workloads with many small read/write operations or row-by-row operations, consider using one of the SMP options. An exception to this guideline is when using stream processing on an HDInsight cluster (e.g., Spark Streaming) and storing the data within a Hive table.
Comparing Data Processing Technologies for data warehousing in Azure
In the ever-evolving landscape of data management and processing, various technologies offer distinct capabilities to cater to diverse business needs. Among the prominent players in this domain are Azure SQL Database, SQL Server (VM), Azure Synapse, Apache Hive on HDInsight, and Hive LLAP on HDInsight. Each of these technologies possesses unique features and functionalities, making them suitable for different use cases.
In this comparison, we will explore the key capabilities of these platforms and understand how they differ in terms of manageability, data orchestration, data integration, compute pausing, data store type, real-time reporting, backup restore options, and processing paradigms (SMP or MPP). By gaining insights into their strengths and weaknesses, organizations can make informed decisions about selecting the most appropriate solution for their specific data processing and storage requirements. Let's delve into the details of each technology to better comprehend their individual offerings and benefits.
Comparison of Capabilities among Azure SQL Database, SQL Server (VM), Azure Synapse, Apache Hive on HDInsight, and Hive LLAP on HDInsight:
General capabilities
- Managed Service
Azure SQL Database: Yes
SQL Server (VM): No
Azure Synapse: Yes
Apache Hive on HDInsight: Yes [1]
Hive LLAP on HDInsight: Yes [1]
- Data Orchestration and Historical Data
Azure SQL Database: No
SQL Server (VM): No
Azure Synapse: Yes
Apache Hive on HDInsight: Yes
Hive LLAP on HDInsight: Yes
- Multiple Data Source Integration
Azure SQL Database: No
SQL Server (VM): No
Azure Synapse: Yes
Apache Hive on HDInsight: Yes
Hive LLAP on HDInsight: Yes
- Supports Pausing Compute
Azure SQL Database: No
SQL Server (VM): No
Azure Synapse: Yes [2]
Apache Hive on HDInsight: No [2]
Hive LLAP on HDInsight: No [2]
- Relational Data Store
Azure SQL Database: Yes
SQL Server (VM): Yes
Azure Synapse: Yes
Apache Hive on HDInsight: No
Hive LLAP on HDInsight: No
- Real-time Reporting
Azure SQL Database: Yes
SQL Server (VM): Yes
Azure Synapse: No
Apache Hive on HDInsight: No
Hive LLAP on HDInsight: Yes
- Flexible Backup Restore Points
Azure SQL Database: Yes
SQL Server (VM): Yes
Azure Synapse: No [3]
Apache Hive on HDInsight: Yes [4]
Hive LLAP on HDInsight: Yes [4]
- SMP/MPP (Shared Memory Processing / Massively Parallel Processing):**
Azure SQL Database: SMP
SQL Server (VM): SMP
Azure Synapse: MPP
Apache Hive on HDInsight: MPP
Hive LLAP on HDInsight: MPP
Note
- Manual configuration and scaling are required for Apache Hive and Hive LLAP on HDInsight.
- Azure Synapse supports pausing computing by managing the HDInsight cluster lifecycle and attaching an external data store.
- Azure Synapse provides restore points within the last seven days, with snapshots taken every four to eight hours.
- For greater backup flexibility, consider using an external Hive metastore or third-party HDInsight backup solution like Imanis Data.
Scalability capabilities
Azure SQL Database, SQL Server (VM), and Azure Synapse all provide redundant regional servers for high availability, ensuring data is accessible even in the event of a failure. However, when it comes to query scale-out, Azure Synapse, Apache Hive on HDInsight, and Hive LLAP on HDInsight have an advantage, as they support distributed queries, enabling the processing of large datasets across multiple nodes for improved performance.
Regarding dynamic scalability, Azure SQL Database and Azure Synapse stand out. Both solutions offer the ability to adjust their computing power according to demand. Azure Synapse, in particular, allows users to scale up or down by adjusting the number of data warehouse units (DWUs), providing significant flexibility in managing resources. In-memory caching of data is a common feature across all solutions, enhancing query performance by storing frequently accessed data in memory for faster access.
Overall, the choice of a data management solution should consider the application's specific requirements and workload, as each option has its strengths and might be better suited for different scenarios.
Security capabilities
Azure SQL Database, SQL Server in a virtual machine, and Azure Synapse all support authentication through SQL authentication and Azure Active Directory (Azure AD). SQL Server in a virtual machine also offers authentication through Active Directory, providing additional flexibility.
When it comes to authorization and auditing, all solutions offer these capabilities, allowing users to control access to data and track data access activities for security and compliance purposes.
Data encryption at rest is a feature supported by all solutions. It requires using Transparent Data Encryption (TDE) to encrypt and decrypt data while it is stored on disk, ensuring data remains secure even when not actively in use.
Row-level security is available in Azure SQL Database, SQL Server in a virtual machine, and Azure Synapse, enabling users to define access controls at the row level, restricting access to specific data rows based on certain conditions. However, Apache Hive on HDInsight lacks this feature.
All solutions support firewalls, enabling administrators to define rules that control which IP addresses can access the data management services. Additionally, when used within an Azure Virtual Network, firewalls are supported for all services.
Dynamic data masking, a feature that allows sensitive data to be masked or hidden from certain users, is available in Azure SQL Database, SQL Server in a virtual machine, and Hive LLAP on HDInsight. However, it is not supported in Azure Synapse and Apache Hive on HDInsight.
While all solutions share many security features like authentication, authorization, auditing, and data encryption at rest, the availability of specific security features such as row-level security and dynamic data masking may vary, and users should consider their specific security requirements when selecting a data management solution.
Selecting the right data management and processing technology for a data warehouse is critical for efficiently managing and analyzing large volumes of data. The comparison of capabilities among Azure SQL Database, SQL Server (VM), Azure Synapse, Apache Hive on HDInsight, and Hive LLAP on HDInsight reveals unique strengths and functionalities for each platform.
In terms of general capabilities, Azure SQL Database, Azure Synapse, Apache Hive on HDInsight, and Hive LLAP on HDInsight stand out as managed services, simplifying administration and maintenance. They also support data orchestration, historical data storage, and integration of multiple data sources and offer various backup and restore points options. However, SQL Server (VM) provides greater flexibility for organizations that prefer to manage their own servers.
Scalability capabilities highlight Azure Synapse as a standout with support for dynamic scalability through adjusting data warehouse units (DWUs), empowering organizations to efficiently manage resources based on demand. It also supports query scale-out, allowing distributed queries for improved performance.
Regarding security capabilities, all solutions offer authentication, authorization, and data encryption at rest. Azure SQL Database, SQL Server in a virtual machine, and Azure Synapse provide robust security features, including row-level security and dynamic data masking. However, Apache Hive on HDInsight lacks row-level security and dynamic data masking capabilities, making it less suitable for handling sensitive data.
Conclusion
The choice of the data management and processing technology for a data warehouse should be based on specific business requirements, workload patterns, data sizes, security considerations, and the level of managed service desired. Organizations must carefully assess their needs and align them with the strengths and weaknesses of each solution to make informed decisions that will optimize data processing, storage, and analytics for their unique use cases.