As we all know that data is the new oil in the world, but it is more than that. The data projection and insights generated can make or break a company’s prospects. Every organization will face challenges in some form in any or all the below actions.
- Acquiring/data procurement
- Storing and archiving the data / Warehousing
- Transforming into insights / ETL
These three are very important and basic responsibilities of any Database/BI team in a company. The data they get will be from disparate sources, it should be made sure to be integrated and meaningful transformation has been made. The visual insights obtained after transformation will help the management decide the strategies and set achievable goals.
How useful is the real-time scenario?
For example, HSK Ltd is one of the largest grocery retail stores. The company obviously will try to collect terabytes of data produced by the purchases in the stores and wants to analyze them to gain insights into customer preferences, demographics, and behavior. This will help them serve the products based on the target audience which can drive business and makes customers happy. For this, the company needs to cross-reference data like customer details which are stored in the on-premises data store and must be combined with our collected log data in the cloud data store. For the main part, to gain insights it must process the joined data and publish the transformed data using few other Azure services like Azure synapse and HDInsight and then build a report on top of it. This can be scheduled to run on a daily basis as well.
All of these can be taken care easily by the azure data factory. The best thing is all of these can be achieved without any requirement of coding as part of code-free ETL as a service and is serverless!
Image source: Microsoft docs
ADF Functions
Collect
The primary step in building a system is by acquiring data from different sources to process them.
Transform
After data is available in the cloud datastore transformation is initiated. One can prefer to code or use azure tools to build and maintain data flows.
Monitor
To monitor the scheduled activities and pipelines. Many built-in supports via an azure monitor, PowerShell, Azure monitor logs, etc are available.
Top Level Concepts
Pipelines
The pipeline is a single unit of the larger part of work, together which performs a task
Activities
In simple words activity is a process step that we configure for completing a task, these are the real actions which we expect. Activities can take an input and produce our desired outcome as a dataset
Datasets
This points out the data we want to use for the activity as input/output
Linked services
Linked services are similar to connection strings. They hold the connection information for the Azure data factory to connect the external sources
Triggers
Triggers act like a scheduler and make sure the execution process step is started
Conclusion
This is an introduction to Azure Data Factory. In the upcoming articles, we will look more into practical and real-time activities.