Cleansing Files of Duplicate records via Azure Data Factory / Synapse

Problem Statement

Data Cleaning is an important part of ETL/ELT processes as it ensures that only high-quality data is loaded into the system. This helps to improve the accuracy of the data analytics.

So how to remove the duplicate records from the file as a part of the clean-up process via ADF / Synapse Pipeline.

Prerequisites

  • Azure Data Factory / Synapse

Solution

In the case of SQL, for the below scenario assuming the Uniqueness of the rows is based on Column C1 values; the rows highlighted indicate the duplicate rows.

solution

To remove the duplicate records in SQL, we can make use of the ROW_NUMBER and PARTITION / OVER BY function as seen below

removeDuplicates

To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.

We would leverage Dataflow activity to achieve the use case.

Below is the flow for the same.

flow

Sample File

sampleFile

Create a Source flow and the corresponding dataset mapping to the Azure Blob storage (in this scenario) location wherein the duplicate data file is present.

Source location

SourceLocation

Dataset

Dataset

where Dataset Parameter

DatasetParams

SourceSettings

Data Preview

DataPreview

  1. Select the Window function to derive the ROW_NUMBER functionality.
    windowSetting
  2. Select the column via which we need to Partition.
    Note. In the case of Composite keys, we can select multiple columns.
    Incoming Stream
  3. In Sort, select the column via which we need to sort in case of duplicate rows.
    IncomingRangeBy
    IncomingWindowColumn
  4. Add a Column with the name, RowNbr, with the expression as rowNumber().

Data Preview

DataPreview

Filter the records with RowNbr equal to one as those represent unique rows via the Filter function.

FilterSettings

//Filter On Expression :
RowNbr==1

Data Preview

DataPreview1

Finally, add a Sink section to generate the cleansed file.

SinkSection

where Dataset

atasetConnection

With Sink Settings as below

SinkSettings

And in Mapping, delete the additional column RowNbr mapping.

Mapping

Output

Create a Pipeline, call the Dataflow via Dataflow activity, and trigger the pipeline.

Output

Result

Result