Change Data Capture (CDC)
For any ETL requirement involving a huge amount of data, most of the problem is solved when you eliminate repeated or redundant processes in your data storage mechanism. You should not repeat the work to copy or move the data you already have in your destination datastore. Hence, the best option is to copy only the data that has changed since the last time the pipeline has run in the source datastore. The recently released feature of Change Data Capture (CDC) from Azure is handy for this. In this article, we will not be looking at the basics. Instead, we will see a step-by-step implementation in the azure data factory.
Demo
Go to the data factory studio. Under the Factory resources option beneath the pipelines option, a CDC feature will be enabled.
After selecting it, you will be requested to select the database, its linked service, and the table and columns that you will enable the CDC.
Please keep in mind that a date column should be present for the table to enable CDC. The source selection option will not populate the tables that don’t have the date column.
I selected a table with the modify_date column containing the date.
The destination I want the incremental to be copied is a file type. I chose this because I want the changes saved as separate CSV files. There are many types, including SQL server database tables which you can select as your choice.
Once after this, when you click next, you will see the CDC canvas with the CDC instance starting with the source and destination you selected listed below.
CDC will be billed as a separate instance with 4 v-core compute, just like any other compute as long as it is running.
- Select the checkbox for the table you want to enable the CDC. There might be multiple tables depending upon the selection you made in the previous steps, which you can enable/disable as needed.
- Set the latency interval under which the CDC will run. Currently, the default is 15 mins, but I will set it in real-time, which means the changes will be copied as soon as it is made in the source.
With all the settings being made, let’s go ahead and publish it first and then test it. First, let’s check the number of records in the source and the destination ADLS container where the changes will be copied as CSV files.
You cannot publish when the CDC instance is running. Stop it first, publish it, and start the CDC instance again.
Now let me go ahead and insert one record into the table.
Now check the destination folder. You will receive two files. One contains the changes, and the other is a log file indicating the result - success or failed.
incremental_files_dest/_SUCCESS
incremental_files_dest/part-00000-31fabe54-e9ec-4aac-b090-7cf3d24e551d-c000.csv
Let me add 3 more records to the table.
The last record was saved as a separate file because I consumed a few mins of gap between the two inserts. Since we have set the latency to the bare minimum, the process picked up the record as soon as it was written to the table.
Summary
We have seen the real-time implementation of the change data capture feature in the data factory. But beware that there might be newer challenges/errors that you must troubleshoot when implementing massive data sets, as this is a relatively new feature in ADF.