To learn how to create HDInsight Spark Cluster in Microsoft Azure Portal please refer to part one of my artcile. After creation of spark cluster named suketucluster.azurehdinsight.net, I have highlighted the URL of my Cluster.
A total of 4 nodes are created -- 2 Head Nodes and 2 Name Nodes -- for a total of 16 cores and an available total of a 60 cluster capacity; out of it 16 are used and 44 clusters remain for scaling up. You can also click and visit Cluster Dashboard, Ambari View and also you can scale the size of clusters.
Apache Ambari is for management and monitoring of Hadoop clusters in the form of WEB UI and REST services. Ambari is used to monitor the clusters and make changes in configuration. Ambari is used for provision, monitoring and managing the clusters in an easier way. Using Ambari you can manage central security setup and fully visibility into cluster health. Ambari Dashboard looks like below,
Using Ambari Dashboard you can manage and configure services, hosts, alerts for critical conditions etc. Also many services are integrated using Ambari WEB UI. Below is Hive Query Editor through Ambari,
You can write, run and process the Hive Query in Ambari WEB UI you can convert that result in to charts etc you can save queries manage history of queries etc.
Above snapshot is a list of services available in Ambari and below is HDInsight SuketuSpark clients list.
In the new browser you can type https://CLUSTERNAME.azurehdinsight.net/jupyter/tree or you can directly click on Jupyter Logo in azure portal to open Jupyter notebook. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. Jupyter and zeppelin are two notepads integrated with hdinsight.
You can use Jupyter notebook to run Spark SQL queries against the Spark cluster. HDInsight Spark clusters provide two kernels that you can use with the Jupyter notebook.
- PySpark (for applications written in Python)
- Spark (for applications written in Scala)
PySpark is the python binding for the Spark Platform and API and is not much different from the Java/Scala versions. Learning Scala is a better choice than python as Scala being a functional langauge makes it easier to paralellize code, which is a great feature if working with Big data.
Like Java, Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching.
When you type https://CLUSTERNAME.azurehdinsight.net/zeppelin or when you click on zeppelin icon in azure portal than zeppelin notepad will be open in new browser tab. Below is a snapshot of that.
A Zeppelin is web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.