Working With Free Community Edition Of Databricks Spark Cluster

Article

I have already written some articles about Azure Databricks cluster creation and its usage in C# Corner.

In the above articles, I explained about Azure Databricks Spark Cluster. To test these services, you must have an Azure account. Many people do not have an Azure account yet. Those who do not have an Azure account, please try Databricks Community edition. It is free.

In this article, we will see the steps for creating a free community edition Databricks account and we will also see the basic table actions. (CRUD Operations)

Databricks is a company founded by the creators of Apache Spark, and it aims to help clients with cloud-based big data processing using Spark. They are involved in making Apache Spark, a distributed computing framework built atop Scala (Scala is a programming language, which is a modified version of Java. It uses JVM for compilation). Databricks develops a web-based platform for working with Spark, that provides automated cluster management.

Step 1 - Creating a free Databricks community edition account

Please use this URL to sign up for a free account.

Enter your personal details and click the “Sign Up” button.

After a few seconds, your account will be created.

Please check your email and you will be notified with a mail.

You must verify your account by clicking the link provided in the mail. It will redirect you to a password reset screen. Please give a password to your account.

Now, our Databricks account is ready to use.

Step 2 - Cluster Creation

You can use the “Clusters” menu in the left pane of the dashboard or you can use the “New Cluster” option in the “Common Tasks” on the dashboard to create a new cluster.

Please note we are using the free edition of Databricks Spark cluster. You only get a single driver cluster. In this cluster, we don’t have any worker nodes. The driver memory is 6.0 GB only. For our testing and educational purposes, this is more than enough. Later we can upgrade this account to the enterprise edition.

Our cluster creation will take some time to finish. Now, it shows the “Pending” state.

After some time, our cluster will be ready.

Step 3 - Upload CSV File

We can create a table now. We are going to create the table from the existing CSV file. We first upload the CSV from our local system to DBFS (Databricks File System.) I am using a sample CITY_LIST.CSV file which contains the sample data for some cities in India.

Please click the “Data” menu and click “Add Data” button.

You can browse the CSV file from the local system and upload.

Now, our file will be saved to “/FileStore/tables” folder. This is the DBFS file system.

Step 4 - Table Creation

We can use either “Create Table with UI” or “Create Table in Notebook” option to create the new table. I am going with “Create Table with UI” option. Both are easy to use. If you want to familiarize yourself with Python/Scala commands, please go with the second option.

Please choose the cluster name from the drop down and click “Preview Table” button and choose table attributes. I selected “First row is header” option.

Please click the “Create Table” button. Our table will be created shortly.

We can read the data from the table using a SQL notebook.

For creating a notebook please click “Workspace” menu and click “Create” “Notebook”

Currently, there are four types of Notebooks supported by Databricks. (Python, Scala, SQL, and R) In this article, we will use SQL notebook.

Please give a valid name to your notebook and click “Create” button.

Now our notebook is created, and we can give a query to get the data from the table.

The above SQL statement is a simple read statement and will return all the records from the table. You can enter “Shift+Enter” key get the result.

You can even get the result in a graph format too. It is easy to customize the plot options.

We have created a table from an existing CSV file and read the records using SQL notebook. Please note you can’t update any record in this table. If you try to update this table, you will get below error message.

We have created this table from a CSV file. It does not support ACID transactions. To avoid this issue, we can use the Databricks delta feature.

Databricks Delta delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks DBFS. It supports ACID transactions.

Step 5 - Create a table with Delta feature

Now, we can create a table with the same structure as our previous “city_list” table.

We can insert data into this table from our existing table using the below statement.

We can verify the record count in a new table.

We got a total of 564 records from this new table, the same as the previous table. We confirmed that all the existing data are successfully inserted into the new table.

We can again check the “UPDATE” statement in this new table.

This time, we didd not get any error message. That means, our record is successfully updated. We can verify it by below statement.

Previously it was “Quilon”, now it is changed to “Kollam”. Our update query worked successfully.

Now, we can check the “DELETE” statement as well.

We can verify the deletion by the below statement.

Now, we got only 563 records. Previously it was 564 records. One record was successfully deleted from our table.

We have seen all CRUD operations in this DELTA table. Please note this free Databricks community edition has some limitations. We can’t use all the features of Databricks.

In this article, we have seen the steps for creating the free Databricks community account and we created a normal table from existing CSV file and later we created a table with DELTA support. This delta table supports all CRUD operations and ACID features.