In this article, we will see all the steps for creating an Azure Databricks Spark Cluster and querying data from Azure SQL DB using the JDBC driver. Later we will save one table of data from SQL to a CSV file.
Step 1. Create Azure Databricks workspace
Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the original founders of Apache Spark (MateiZaharia who created Apache Spark is the co-founder and Chief Technologist of Databricks), Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click setup, streamlined workflows, and an interactive workspace which enables collaboration between data scientists, data engineers, and business analysts.
I have already worked with Azure HDInsight which also contains the Spark Cluster provided by Hortonworks, but I am really impressed with the features of Databricks. It has a very powerful UI which gives users a feel-good experience. Both HDInsight & Databricks have many pros and cons that I will cover in a separate article later.
As of now, log in to the Azure Portal.
Choose "Create a new resource" select Analytics from Azure marketplace and choose Azure Databricks.
Please give a valid name to our Databricks service and then choose the resource group. If you don’t have a resource group, select the "Create New" option. You must choose a geographic location and pricing tier. Please note that currently, Databricks provides a 14-day trial period for users to verify this service. We can convert this plan to Premium later.
After some time, your Databricks workspace will be created and we can launch the workspace now. Please note that this is the only workspace and we must create the Spark Cluster from this workspace.
Our workspace will be loaded shortly after the Single Sign-On mechanism.
We can click the "Clusters" button on the left menu of the dashboard and then click the "Create Cluster" button.
We must give a valid cluster name and choose the number of worker systems we need.
Currently, Databricks provides various types of system configurations. I chose the default Standard_DS3_v2 and 1 worker node for testing purposes. This type of server has 4 cores and 14 GB RAM for each node. I selected the Driver type, the same as a worker. So, the driver will have the same configuration as the node system. Please note, that in a Spark cluster configuration, we have one driver system and may have multiple nodes. The driver controls all parallel operations.
Databricks also provides an option to give some initial configurations to our cluster.
I have not given any default configurations so click the "Create Cluster" button. After some time, our cluster will be ready for use.
Step 2. Create an Azure SQL Database
We are going to query the SQL database in our cluster. We can create our Azure SQL DB now.
We create this database with the existing data source. I chose Sample (AdventureWorksLT) as the data source. We must select one database server for this database. If you don’t have a server, please create a new one.
After choosing the server, you can configure the required price tier for our database.
I opted for the Basic plan, so it will be minimal, however, enough for our testing purposes. After some time, our database will be created and we can use it in our Spark cluster.
Please note that Microsoft now provides a Query Editor (currently in preview mode only) for viewing and updating our data like our traditional SSMS.
We are going to read SalesLT.Customer table data in our Spark Cluster.
Step 3. Querying SQL data in Databricks Spark cluster
We can connect the SQL database using JDBC. We use Scala Notebook to query the database. Notebook is an editor where we can enter our Spark commands. Currently, Databricks supports Scala, Python, SQL, and Python languages in this notebook.
Please click the "Create Notebook" link in the dashboard and choose Scala as our scripting language.
We can check the JDBC driver connectivity by the following command.
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
After entering the above command, please choose Shift + Enter to execute the command.
By default, Spark Cluster is configured with SQL Server JDBC driver. Our above command will execute successfully.
We can give the server name and database name to the below variable and execute the command.
val jdbcHostname = "{sql server host name}"
val jdbcPort = 1433
val jdbcDatabase = "{sql db name}"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
We can give the credentials of our database server to the below connectionProperties variable and this variable will be used later in our query execution.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "{user name}")
connectionProperties.put("password", "{password}")
Check the connectivity to the SQL database using the following command.
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
Now, we are going to query the data from the SQL database. For that, we set a data frame variable customer using the below command.
val customer = spark.read.jdbc(jdbcUrl, "SalesLT.Customer", connectionProperties)
Please note that our customer variable now holds all the data from SalesLT.Customer table.
We can list the table structure using the below command.
customer.printSchema()
We can also list the table values using the below command.
customer.select("CustomerID", "Title", "FirstName", "LastName", "CompanyName").show()
Please note that in the above command, we supplied 5 column names explicitly and we got values for those columns only. You can use all the Spark commands to query the data.
Step 4. Saving the queried data as a CSV file
We can save the above-queried data as a CSV file easily. For that, we need to mount the storage account with DBFS (This is a custom filesystem developed by Databricks to handle the file operations). We must use the below command to mount our existing storage account in DBFS. (You must have an Azure blob storage account, otherwise please create a new one).
dbutils.fs.mount(
source = "wasbs://{container name}@{storage account name}.blob.core.windows.net/",
mountPoint = "/mnt/{mount point name}",
extraConfigs = Map("fs.azure.account.key.{storage account name}.blob.core.windows.net" -> "{storage account access key}")
)
The above command will create a DBFS mount point “mnt/Sarath”. We can use this mount point for our file system operations.
Please note that our Spark Job returned a true value. That means our command is successfully executed. If there is any error during execution, it will show the exception with details.
We can save our data frame using the below command.
customer.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("{your mount point location}.csv")
The above command will save all our data from the customer data frame to the customer.csv file. We can see the new file in Azure Storage Explorer.
Please note that this CSV is saved as a multi-part file. Hence it has a folder also with the same name. This folder has many files as shown below.
We can check the CSV file by getting the data from CSV using a data frame reader. There are various options available to read the data. We used the below method.
val customerFromCSV = spark.read.format("csv").option("header", "true").load("/mnt/sarath/customer.csv")
customerFromCSV.select("CustomerID", "Title", "FirstName", "LastName", "CompanyName").show()
In this article, we created a new Azure Databricks workspace and then configured a Spark cluster. After that, we created a new Azure SQL database and read the data from the SQL database in the Spark cluster using the JDBC driver and later, saved the data as a CSV file. We again checked the data from CSV and everything worked fine.
Hope you enjoyed this article. This is my third article on Azure Databricks. The other two are listed below.