Getting started with Microsoft Fabric using Dataflow Gen 2

Wilson Mok
1y
2.1k
0
3

Article

Introduction

In this article, we will delve into Microsoft Fabric and provide a step-by-step guide on how to use Dataflow Gen 2 to ingest your data to create insights. If you are a data professional, developer, or IT leader, this article will equip you with practical skills and insights to leverage Microsoft Fabric in your data projects.

What is Microsoft Fabric and how it is different?

Prior to Microsoft Fabric, a project team requires to utilize many Azure services to manage the data lifecycle. Your solution might look similar to this.

Azure Data architecture

There are two main drawbacks to this approach.

Integration challenges: Integrating multiple services to work seamlessly is often complex. Data had to be moved to different services leading to delays and potential data loss or inconsistencies.
Collaboration challenge: Team members are required to specialize in their toolset. This limits the ability of the team to collaborate and knowledge sharing.

To address those challenges, Microsoft Fabric provides a unified service to reduce the complexity and streamline the data operations. This service ensures the business, data, and AI professionals can collaborate effectively to deliver the data product quickly and efficiently.

Product

Let’s have a quick look at each component.

Data Factory: A set of services that allows for data integration and workflow automation. Enables users to create, schedule, and orchestrate workflows to move and transform data. The main services are Dataflows Gen 2 and Data Pipelines.
Data Engineering: A set of services that focuses on ingest, process and store data at scale. The main services are Lakehouse, Dataflow Gen 2, Spark Notebook, and Data Pipelines.
Data Warehouse: A SQL-centric database designed to support business decision-making. In Fabric, the data is stored using the open Delta Lake format to improve performance and scalability.
Data Science: A set of services to facilitate the development and deployment of ML and AI models. This includes Spark Notebook, MLFlow, SynapseML (previously MMLSpark), and Prebuilt AI models.
Real-time Analytics: A set of services to process and analyze event or streaming data. This includes leveraging the KQL database, KQL query set, and low-code event processing.
Power BI: Provides interactive visualizations to create dashboards and reports to share insight across the organization.
Data Activator: A service that automates responses due to changing data. This includes sending alert emails, sending team messages, and triggering custom actions like Power Automate.

Having explored the core components of Microsoft Fabric, let's dive into the hands-on tutorial.

Tutorial

Let's verify you have Fabric enabled. To do this, you need to log in to https://app.powerbi.com/.
On the bottom left, click on the 'Power BI' logo and you should see a list of Fabric components we discussed above. Click on 'Data Engineering'.
If you do not see the Fabric components, this means Fabric is not enabled yet. You can follow the link: Start a Fabric trial to enable Fabric.
Next, we will create our first Lakehouse. A Lakehouse is used to store our data in the workspace.
Let's name our new lakehouse: 'My_Lakehouse' and click 'Create'. In the screenshot, you can see 'My_Lakehouse' at the top and bottom right. Additionally, we can see this lakehouse belongs to 'My workspace'.
In the Explorer panel, we can see two folders.
- Tables: Designed to store structured data like a database table. The data is stored in the open Delta Lake format. You can query the data using standard SQL.
- Files: This folder stores all other types of data, including documents, images, and JSON files.
Now, we can ingest our data using Dataflow Gen2 to create a table. Click on the 'New Dataflow Gen2'. This will take a moment to open up.
A new data flow is created named 'Dataflow 1'. Click on 'Get data' and select 'Text/CSV'.
Note. We can rename the Dataflow by clicking on 'Dataflow 1'
Select 'Upload file (Preview)' and upload the sample file (Download). This file will be uploaded to your OneDrive.
Now, you will see a preview of your data. Click 'Create' to continue.
You are back to the Dataflow Gen2 screen and you can see the steps, the query and the data.
In the screenshot, you can see some statistics generated for each column. You can do this by going to View > Data View and checking 'Enable column profile'.
This will take a minute to run. After it is completed, you will receive an alert that 'Dataflow 1' published successfully.
We are back in our workspace and we can see the 'Dataflow 1' and 'My_Lakehouse'. To view our new sales data, click on 'My_Lakehouse' with the type: 'SQL analytics endpoint'.
Note. It is very important we select the 'SQL analytics endpoint' type instead of the 'Lakehouse' or 'Semantic model'.
With 'My_Lakehouse' open, we can see the 'sample_sales_data' table created under the 'Tables' folder. If you do not see it, click on the 'Refresh icon'.
In addition to viewing the data, you can interact with your data as well.
- New SQL query: Similar to SSMS or Data Studio, you can write custom SQL queries on the data.
- New visual query: Provides a no-code way to work with the data.
- New report: Create visual reports using the data.

Summary

In this article, we have learned.

Microsoft Fabric is a unified service that enables all stakeholders to collaborate together. This service supports batch processing, streaming & event processing, reporting and the ability to respond to data events.
We have set up a Lakehouse and ingest data using Dataflow Gen 2 and can view the data.
The Lakehouse has the built-in ability for the user to interact with the data through query or report creation.

Happy Learning!