HDInsight makes Apache Hadoop available as a service on the cloud. It makes the
MapReduce software framework available in a simpler, more scalable, and cost
efficient Azure environment. HDInsight also provides a cost efficient approach
to the managing and storing of data, using Azure Blob storage. In this article,
you will provision a Hadoop cluster in HDInsight using the Azure Management
Portal, submit a Hadoop MapReduce job using PowerShell, and then import the
MapReduce job output data into Excel for examination.
The HDInsight provision process requires an Azure Storage account to be used as
the default file system. The storage account must be located in the same data
center as the HDInsight compute resources. Currently, you can only provision
HDInsight clusters in the following data centers.
- Southeast Asia
- North Europe
- West Europe
- East US
- West US
You must choose one of the five data centers for your Azure Storage account.
To create an Azure Storage account -
- Sign in to the Azure Management Portal.
- Click "NEW" on the lower left corner, point to DATA SERVICES, point to STORAGE, and then click QUICK CREATE.
- Enter URL, LOCATION, and REPLICATION, and then click CREATE STORAGE ACCOUNT. Affinity groups are not supported. You will see the new storage account in the storage list.
- Wait until the STATUS of the new storage account is changed to Online.
- Click the new storage account from the list to select it.
- Click "MANAGE ACCESS KEYS" from the bottom of the page.
- Make a note of the STORAGE ACCOUNT NAME and the PRIMARY ACCESS KEY. You will need them later in the tutorial.
Provision HDInsight 3.0 cluster is currently only supported using the custom create option.
To provision an HDInsight cluster -
- Sign in to the Azure Management Portal.
- Click HDINSIGHT on the left to list the HDInsight clusters under your account.
- Click NEW on the lower left side, click DATA SERVICES, click HDINSIGHT, and then click CUSTOM CREATE.
- From the Cluster Details tab, enter or select the following values.
- Click the right arrow in the bottom right corner to configure cluster user.
- From the Configure Cluster user tab, enter User Name and Password for the HDInsight cluster user account. In addition to this account, you can create an RDP user account after the cluster is provisioned, so you can remote desktop into the cluster.
- Click the right arrow in the bottom right corner to configure the storage account.
- From the Storage Account tag, enter or select the following values:
- Click the check icon in the bottom right corner to create the cluster. When the provision process completes, the status column will show Running.
Run a WordCount MapReduce job
Now, you have an HDInsight cluster provisioned. The next step is to run a MapReduce job to count the words in a text file.
Running a MapReduce job requires the following elements -
1. A MapReduce program
In this tutorial, you will use the WordCount sample that comes with the HDInsight cluster distribution, so you don’t need to write your own. It is located on /example/jars/hadoop-mapreduce-examples.jar.
2. An input file
You will use /example/data/gutenberg/davinci.txt as the input file.
3. An output file folder
You will use /example/data/WordCountOutput as the output file folder. The system will create the folder if it doesn’t exist.
The URI scheme for accessing files in Blob storage is.
To run the WordCount sample
- Open Azure PowerShell. For instructions of opening Azure PowerShell console window, run the following commands to set the variables.
-
- $subscriptionName = "<SubscriptionName>"
- $clusterName = "<HDInsightClusterName>"
- Run the following commands to create a MapReduce job definition.
- # Define the MapReduce job
- $wordCountJobDefinition = New-AzureHDInsightMapReduceJobDefinition -JarFile "wasb:///example/jars/hadoop-mapreduce-examples.jar" -ClassName "wordcount" -Arguments "wasb:///example/data/gutenberg/davinci.txt", "wasb:///example/data/WordCountOutput"
The hadoop-mapreduce-examples.jar file comes with the HDInsight cluster distribution. There are two arguments for the MapReduce job - the source file name and the output file path.
The source file comes with the HDInsight cluster distribution, and the output file path will be created at the run-time.
Run the following command to submit the MapReduce job.
- # Submit the job
- Select-AzureSubscription $subscriptionName
- $wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $wordCountJobDefinition
In addition to the MapReduce job definition, you must also provide the HDInsight cluster name where you want to run the MapReduce job.
The Start-AzureHDInsightJob is an asynchroized call. To check the completion of the job, use the Wait-AzureHDInsightJobcmdlet.
Run the following command to check the completion of the MapReduce job.
- Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600
Run the following command to check any errors with running the MapReduce job.
- # Get the job output
- Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId –StandardError
To retrieve the results of the MapReduce job, open Azure PowerShell.
Run the following commands to create a C:\Tutorials folder, and change directory to the folder.
- mkdir \Tutorials
- cd \Tutorials
The default Azure Powershell directory is C:\Windows\System32\WindowsPowerShell\v1.0. By default, you don’t have the write permission on this folder. You must change directory to a folder where you have write permission.
Set the three variables in the following commands, and then run them.
- $subscriptionName = "<SubscriptionName>"
- $storageAccountName = "<StorageAccountName>"
- $containerName = "<ContainerName>"
The Azure Storage account is the one you created earlier in the tutorial. The storage account is used to host the Blob container that is used as the default HDInsight cluster file system. The Blob storage container name usually shares the same name as the HDInsight cluster unless you specify a different name when you provision the cluster.
Run the following commands to create an Azure storage context object.
- # Create the storage account context object
- Select-AzureSubscription $subscriptionName
- $storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary }
- $storageContext = New-AzureStorageContext -StorageAccountName $storageAccountName -
- StorageAccountKey $storageAccountKey
The Select-AzureSubscription is used to set the current subscription in case you have multiple subscriptions, and the default subscription is not the one to use.
Run the following command to download the MapReduce job output from the Blob container to the workstation.
- # Download the job output to the workstation
- Get-AzureStorageBlobContent -Container $ContainerName -Blob example/data/WordCountOutput/part-r-00000 -Context $storageContext -Force
The example/data/WordCountOutput folder is the output folder specified when you run the MapReduce job. part-r-00000 is the default file name for MapReduce job output. The file will be downloaded to the same folder structure on the local folder. For example, in the following screenshot, the current folder is the C root folder. The file will be downloaded to theC:\example\data\WordCountOutput\ folder.
Run the following command to print the MapReduce job output file.
- cat ./example/data/WordCountOutput/part-r-00000 | findstr "there"
The MapReduce job produces a file named part-r-00000 with the words and the counts. The script uses the findstr command to list all of the words that contains “there”.
That's it. I hope this article helps you getting started with MapReduce Job on Hadoop Cluster, in Azure HDInsight.