To see how to create an HDInsight Spark Cluster in Microsoft Azure Portal, please refer to part 1 of my article. To understand HDInsight Spark Linux Cluster, Apache Ambari, and Notepads like Jupyter and Zeppelin, please refer to my article about it. In this article, we will learn to run Interactive Spark SQL queries on Apache Spark HDInsight Linux Cluster.
To open Jupiter Notebook, use Google Chrome (suggested by Jupyter) and type https://CLUSTERNAME.azurehdinsight.net/jupyter/tree and apply credentials of cluster. In Jupyter Notebook, we have two clusters :1. PySpark and 2. Spark. In this article, we have used PySpark Cluster to run the Spark SQL queries on BigData.
In this Kernal, there is no need to set the context of Spark and Hive, as they are automatically set. We can use cell magics %%sql to run SQL queries and Hive queries directly on cluster. When you click on "New Notepad", an untitled notepad will be opened. You need to rename it and it will be auto saved. To start you need to import types for that, write the following command and click on the Play (Run the Job) button.
from pyspark.sql.types import *
When you run the job, Kernal will be busy for some seconds.
Now, it's time to load the data into a temporary table named hvac, seen in above screenshot. When we created a cluster, the sample data was available at the path :
\HdiSamples\HdiSamples\SensorSampleData\hvac, in the storage account. The command is shown in the snapshot. We need to load the data from hvac to create the schema, parse the data in to hvacText, create a data frame, and register data frame as a table to run queries in hvac table.
After running the job, Kernal will be busy for a few seconds. In Spark, Dataframes are a distributed collection of data organized into named columns like tables in RDBMS. And RDD in Spark is resilient. Distributed datasets are a fundamental data structure of Spark. It's an immutable distributed collection of objects; each dataset in RDD is divided into logical partitions for computation on different nodes of clusters.
Now, you can run the queries directly on the table. Sample query:
%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
Once the job is successfully completed, the output will be visible in the form of a table.
Cell wise, you can run the queries to fetch meaningful output from millions of records of bigdata within a few seconds. Apache Spark works really faster in memory than anywhere else. Also, you will find readymade buttons, like Pie Chart, Scatter, Line Chart, Area Chart, Bar Chart etc. You can set x axis and y axis to generate the charts as shown below. After finishing the job, you can shut down the notebook to release the resources.
So, this scenario represents that you have installed sensors which keep collecting the temperature information in the building. The sensors will be attached with IoT Kit like Raspberry PI and through UWP apps. The RS Pi will keep collecting temp information via GPIO pins and will send the data to Azure Storage.
We have millions of pieces of data collected in Azure Storage (BigData) and we can use that storage account with HDInsight Spark to run the queries on BigData for meaningful output. We can integrate this result of Hive Query into Azure Machine Learning Studio as well, to predict the temperature of buildings. In the next article, we will learn how we can perform HDInsight Spark on Website Logs.