Give a name to your Spark cluster and then configure the Cluster type. Please select “Spark” as a cluster type and choose the default version. Here, the version is Spark 2.2.0.
Now, we can choose the default storage account. Every HDInsight cluster requires a storage account. If you do not have any storage account, please choose “Create New” option. It will automatically create a storage account along with cluster creation. I already have one storage account so, I selected it. You can add multiple storage accounts to this cluster. Spark can process data from associated storage accounts only. Please note that Spark will automatically create a default blob container in the default storage account.
We can choose the Cluster size. By default, it is 4 worker nodes. Spark cluster must have one driver node and can have multiple worker nodes. In our case, we are using this cluster only for testing purposes. I chose only 1 worker node. Our cost will vary depending on the number of worker nodes we use.
We can choose the node size now. I am opting for D12 V2 node. It comprises 4 cores per node and 28 GM RAM per node. 200 GB is the local SSD size. This is enough for our testing purposes.
We got the cluster summary now. We have a 2 node driver and a 1 node worker available.
Please click the “Create” button and it will take a minimum of 15 to 20 minutes to set up our cluster depending on the network traffic. After some time, our cluster will be created successfully. Please go to the cluster dashboard and see all the details about our cluster.
Step 2 - Upload CSV file to default container in default storage account
In this article, we will process the data from a Postal code CSV file. We have already downloaded this CSV file. We can upload this CSV file to our storage account which is already associated with Spark cluster. Please use Storage Explorer (Now, it is in Preview mode) feature to upload the CSV file.
Please choose the default container associated with our Spark cluster and create a new virtual directory.
We will upload the CSV file to this directory. Please use SAS (Shared Access Signature) authentication type to upload the file.
Step 3 - Run Spark application in HDInsight Spark cluster using IntelliJ IDEA
We can open the IntelliJ project which we created already. Please refer to my previous C# corner
article for creating IntelliJ project with Spark and Scala. We already saw how to get data from a CSV file in a local environment. Please click the “Tools” menu and choose the “Azure” to Sign in. This will open a popup screen and with your Azure credentials please sign in.
We can make a small change to our Scala object file “indianpincode.scala”. Please replace the code with the below code.
- package sample
- import org.apache.spark.sql.SparkSession
- object indianpincode {
- def main(arg: Array[String]): Unit = {
- val sparkSession: SparkSession = SparkSession.builder.appName("Scala Spark Example").getOrCreate()
- val csvPO = sparkSession.read.option("inferSchema", true).option("header", true).
- csv("wasb:///sparksample/all_india_PO.csv")
- csvPO.createOrReplaceTempView("tabPO")
-
-
- sparkSession.sql("select statename as StateName,count(*) as TotalPOs from tabPO group by statename order by count(*) desc").show(50)
- }
- }
Step 4 - Delete Spark Cluster after usage
We have successfully executed our Scala code in Spark cluster and processed data from CSV file located in Blob storage. Now we can delete the cluster. It is very important to delete the cluster after usage. Otherwise, we must pay. Please click the “Delete” button.
You must confirm the delete by giving Spark cluster name in the given box.
In this article, we saw how to create an HDInsight Spark cluster in Azure portal and we uploaded one postal code data CSV file to our blob storage account which was already associated with Spark cluster. Later we connected Azure with IntelliJ IDEA and executed Spark Job in IntelliJ. We also saw the job result on the Ambari portal.