Copy Data From Blob Storage To Cosmos DB Using Azure Data Factory

Sarathlal Saseendran
6y
49.3k
0
7

Article

In this article, we will see how to create an Azure Data Factory and we will copy data from Blob Storage to Cosmos DB using ADF pipelines. We will create the Source and Destination (Sink) datasets in the pipeline and will link these datasets with the Azure subscription. We will publish this pipeline and later, trigger it manually.

Prerequisites

Azure Blob Storage Account
Cosmos DB Account

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.

Step 1 - Create Azure Data Factory

Log into the Azure Portal.

Select "Create a resource" and choose Analytics -> Data Factory.

Copy Data From Blob Storage To Cosmos DB Using Azure Data Factory

Give a valid name to the Azure Data Factory and choose resource group. If you don’t have any existing resource group, please create a new one.

Azure Data Factory will be created shortly.

Step 2 - Store Data in Blob Storage

We can upload a sample CSV file to the Blob Storage now.

Go to Storage Account and click “Storage Explorer” (Currently, it is in the preview mode).

By right clicking the Blob Container, you can see the "Create Blob Container" context menu. Just click it.

Please choose a valid container name (it is case-sensitive) and choose Public access level as Container so that we can access this container later from our Azure Data Factory.

Open the container and upload a sample CSV file to the blob container. I will upload an employee data CSV file which contains only 3 records. Please click the “Upload” button to proceed.

Step 3 - Create a new Database and Collection in Azure Cosmos DB Account

Open Cosmos DB account and click Data Explorer.

Click “New Database” button, give database name, and choose Throughput value. (It is not mandatory, you can simply ignore it).

You can add a new collection to this database by right clicking the database and choosing “New Collection” button.

Give a name to the Collection and give partition key also. Partition key is like a Primary key in SQL Server database.

Step 4 - Create a Pipeline in Azure Data Factory

We have already created Azure Data Factory. We can create a “Copy Data” pipeline now. Please open Azure Data Factory and click “Author and Monitor” button

It will open the ADF dashboard. Choose “Create Pipeline” option.

We can create a new activity now. In the filer box, please type “Copy” it will show the “Copy Data” option under Move & Transform tab. You can drag this activity to the work area as I did.

We can rename the Activity in “General” tab. I have given a small description also.

In the Source tab, you can select the source dataset. Please click “New” Button. It will list all the data sources available in ADF. Currently, Microsoft supports more than 70 data sources.

As our source dataset is Blob storage, please choose it and click the “Finish” Button.

Choose the “Connection” tab and click “New” button to create a new linked service for source dataset.

You can choose your Azure subscription and choose the already created Storage account name and click the “Finish” button.

We can choose container name in Connection by browsing it and choose the container name from blob storage.

You can ignore the file name. It will automatically pick the file from the container.

In our CSV, the first row contains the column name. Select “Column name in first row” option.

You can click the pipeline and choose the Sink tab. It is for Destination dataset.

Please click the “New” button to choose Destination dataset and select Cosmos DB as destination data source and click “Finish” button.

Choose Connection tab and click “New” button to create new linked service for Cosmos DB.

Choose your Azure subscription from the dropdown list and select Cosmos DB account name and database name from the list and click “Finish” button.

You can choose the Collection name from the list. (We have already created the collection in Cosmos DB account)

We have created Source Dataset and Sink Dataset in our pipeline. We can validate the pipeline and datasets before publishing it.

We can see the validation errors if anything occurred.

Our validation was successful. We can now publish ADF.

It will take some time to publish all the changes.

After a successful publish, we can Trigger the pipeline.

Click “Trigger” Button and choose “Trigger Now” It will open a window and choose “Finish” button.

We will be notified with a message that the pipeline succeeded.

Our data integration is completed now. We can open the Cosmos DB to check the copied data from Blob Storage. You can see that there are three records (documents) available in Cosmos DB. As I mentioned earlier my CSV file contains 3 records.

You can download the ARM (Azure Resource Manager) template for this ADF for future use. ARM template contains the pipeline and dataset details.

Normally there are two ARM templates available for each ADF.

“arm_template.json” and “arm_template_parameters.json”

arm_template.json

{
"$schema": "http://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"factoryName": {
"type": "string",
"metadata": "Data Factory Name",
"defaultValue": "sarathadf1"
},
"AzureBlobStorage1_connectionString": {
"type": "secureString",
"metadata": "Secure string for 'connectionString' of 'AzureBlobStorage1'"
},
"CosmosDb1_connectionString": {
"type": "secureString",
"metadata": "Secure string for 'connectionString' of 'CosmosDb1'"
},
"AzureBlob1_properties_typeProperties_folderPath": {
"type": "string",
"defaultValue": "sarathcontainer"
}
},
"variables": {
"factoryId": "[concat('Microsoft.DataFactory/factories/', parameters('factoryName'))]"
},
"resources": [
{
"name": "[concat(parameters('factoryName'), '/AzureBlobStorage1')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "[parameters('AzureBlobStorage1_connectionString')]"
}
},
"dependsOn": []
},
{
"name": "[concat(parameters('factoryName'), '/pipeline1')]",
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"properties": {
"activities": [
{
"name": "BlobCopyToCosmosDB",
"description": "Copy data from Blob Storage to Cosmos DB Account.",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "DocumentDbCollectionSink",
"nestingSeparator": ".",
"writeBatchSize": 10000,
"writeBehavior": "insert"
},
"enableStaging": false,
"dataIntegrationUnits": 0
},
"inputs": [
{
"referenceName": "AzureBlob1",
"type": "DatasetReference",
"parameters": {}
}
],
"outputs": [
{
"referenceName": "DocumentDbCollection1",
"type": "DatasetReference",
"parameters": {}
}
]
}
],
"annotations": []
},
"dependsOn": [
"[concat(variables('factoryId'), '/datasets/AzureBlob1')]",
"[concat(variables('factoryId'), '/datasets/DocumentDbCollection1')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/AzureBlob1')]",
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "AzureBlob",
"typeProperties": {
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"nullValue": "\\N",
"treatEmptyAsNull": true,
"skipLineCount": 0,
"firstRowAsHeader": true
},
"folderPath": "[parameters('AzureBlob1_properties_typeProperties_folderPath')]"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/AzureBlobStorage1')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/DocumentDbCollection1')]",
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"properties": {
"linkedServiceName": {
"referenceName": "CosmosDb1",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DocumentDbCollection",
"typeProperties": {
"collectionName": "employee"
}
},
"dependsOn": [
"[concat(variables('factoryId'), '/linkedServices/CosmosDb1')]"
]
},
{
"name": "[concat(parameters('factoryName'), '/CosmosDb1')]",
"type": "Microsoft.DataFactory/factories/linkedServices",
"apiVersion": "2018-06-01",
"properties": {
"annotations": [],
"type": "CosmosDb",
"typeProperties": {
"connectionString": "[parameters('CosmosDb1_connectionString')]"
}
},
"dependsOn": []
}
]
}

arm_template_parameters.json

{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"factoryName": {
"value": "sarathadf1"
},
"AzureBlobStorage1_connectionString": {
"value": ""
},
"CosmosDb1_connectionString": {
"value": ""
},
"AzureBlob1_properties_typeProperties_folderPath": {
"value": "sarathcontainer"
}
}
}

First json file contains all the pipeline and dataset information and second json file contains the details about parameters. You can give the storage account and other connection details in this file. We can import these ARM templates in future and save the time.

In this article, we have created an Azure Data Factory and we have uploaded one simple CSV file to Blob Storage and we have created one Database and empty Collection in Cosmos DB also. We have created a pipeline with two datasets in Data Factory. One for Azure Blob storage account and other for Cosmos DB account. We have linked both datasets with the Azure subscription. Later we published the pipeline and dataset. We have seen how to trigger pipeline on-demand also. We have seen all the data successfully copied from Blob storage to Cosmos DB.