In this article, we will cover the steps for creating an Azure Databricks workspace and configuring a Spark cluster. Later, we will create a simple data table from an existing file and query the data using SQL notebook.
Azure Databricks is the most advanced Apache Spark platform. With a high-performance processing engine that’s optimized for Azure, you’re able to improve and scale your analytics on a global scale—saving valuable time and money, while driving new insights and innovation for your organization. Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges.
For those who are new to Apache Spark, it is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Sparkcore and several related projects: SparkSQL - A module for working with structured data.
Other than Databricks, Hortonworks is also providing the Spark platform in Azure, which is included in HDInsight.Cloudera is another vendor that provides the Spark platform independently.
Step 1. Create the Azure data bricks workspace
Log in to the Azure portal.
Choose to Create a Resource and select Azure Databricks in the filter box. Please choose the workspace name, resource group, and location. If you don’t have a resource group, you can create a new one. Please note that currently, Azure Databricks provides a14 14-day trial premium subscription.
It will take some time to create the workspace and after creation, you can go to the resource and can launch the workspace. It will log in using Single Sign-on.
Now, we are going to create a new Spark Cluster. You can choose the number of worker nodes, driver type, worker type, and a number of workers. Please note that every Spark Cluster must have a driver node and we can choose as many worker nodes as per your subscription and free available cores. Here, I chose the Standard_DS3_v2 worker type and the same driver type. One node has 4 cores and I chose only 1 worker node for testing purposes.
It will take some time to set up the cluster. If you check the status, you can see the cluster creation status is Pending.
After some time, our cluster will be ready, and the status will show as Running.
If you check the resource list, you can see there are 17 new resources created for this cluster.
There is 1 virtual network, 2 virtual machines (1 each for worker and driver), 1 storage account (for meta information, this is a locked resource), 2 public IPs, 1 network security group, 4 network interfaces, and 6 disks (3 each for worker and driver).
Step 2. Create a table using an existing CSV file
We can go to the Data tab and click the Add Data button.
It will open a screen and we can browse our existing CSV from the local system. This is a very simple CSV file containing columns (id, name, and age). I gave 5 rows inside this file.
The uploaded file will be stored in the FileStore/tables folder. Databricks is following a custom DBFS (Data Bricks File System) file system developed by themselves.
Our CSV will be uploaded, and we can choose the create table with the UI button. You can choose to Create Table in the Notebook option also. Notebook is a kind of editor where you can run your SQL, Python, R and Scala codes.
After clicking the Create Table with UI button, a table schema editor screen will appear.
Here, you can change the table name, column names, and even column schemas also. Here, I chose the first row as the header and inferred schema options. Since we are creating this table from our existing CSV file I prefer these options. Please note the UI will show some sample data also.
You can click the Create Table button.
Our table is created successfully, and it shows the sample data too.
Step 3. Querying the data using SQL notebook
We can query our data using SQL notebook. Currently, there are 4 types of notebooks available in Azure Databricks. Python, Scala, R, and SQL. In this article, we use SQL notebook.
Please select Workspace -> Create-> Notebook -> Choose type as SQL
You can give a valid name to your notebook and click the Create button.
We can execute all valid SQL statements in this notebook to query the data.
By default, data is shown in raw table format and if needed, you can view the data visually too. There are various plot options available in the notebook. The below report is shown in Bar chart format for our same data.
Other plot options are shown below.
Hope you got a basic overview of Azure Databricks workspace creation, cluster configuration, table creation and querying the data using SQL notebook. Please note that Spark is not used for simple queries. The aim of multiple clusters is to process heavy data with high performance. We can cover more features of Azure Databricks in the coming tutorials.
Please feel free to give your valuable feedback and queries in the comments.