Why Hadoop 2.7.1 Cluster?
- Recently, Apache has launched Hadoop 3.0.0 Version @ Dec 2017, Then Instead of going with the updated version why we are going with the 2.7.1 version is because of Hadoop 2.7.1 & 2.7.2 are the stable versions.
- Most of the Hadoop Development is Happening only in these versions.
- Further Details about the Hadoop Cluster Refer the Release Notes - http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/releasenotes.html
Un-Tar the File
Go to the Downloaded Location of the Hadoop Packages in the Terminal. Then Execute the below Command.
- tar xzvf hadoop-2.7.1.tar.gz
This command will unzip all the files which are present in the hadoop Folder.
Moving The Hadoop Folder to Our Location & Providing Ownership
- sudo mv hadoop-2.7.1 /usr/local/hadoop
- sudo chown -R raghav:hadoop /usr/local/hadoop - Providing permission to the User to access the contents.
Setting Java Environment Variable
- echo 'export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_71' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We are editing the Hadoop environment script to use Java home variable used by Hadoop and modify the file using the above command. (I have already installed java in the path).
Creating Name Node, Data Node Directories
- sudo mkdir -p /usr/local/hadoop_store/tmp
- sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
- sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
- sudo mkdir -p /usr/local/hadoop_store/hdfs/secondarynamenode
- sudo chown -R hduser:hadoop /usr/local/hadoop_store
We have creaed the Directories for hadoop temporary files, namenode metadata, datanode data and secondary namenode metadata.
Configurations
In order to set up a single node hadoop cluster working properly, we have to modify the 4 xml configuration files which are listed below.
- mapred-site.xml.
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
Modifying mapred-site.xml
The mapred-site.xml file contains the configuration settings for MapReduce daemon on YARN
- sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml - I have moved the hadoop extracted to version that path so I am providing the absolute path of the mapred-site.xml file
- Press (I) Key to edit the configurations & add the highlighted contents to the file & press: wq.
- <configuration>
- <property> <name>mapreduce.framework.name</name>
- <value>yarn</value>
- </property> </configuration>
Screenshot
Modifying core-site.xml
- sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
- cor-site.xml informs Hadoop daemon where NameNode runs in the cluster.
- It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS& MapReduce.
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/usr/local/hadoop_store/tmp</value>
- <description>A base for other temporary directories.</description>
- </property>
- <property>
- <name>fs.default.name</name>
- <value>hdfs:
- <description>
The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation.
The uri's scheme determines the config property fs.SCHEME.impl) naming the FileSystem implementation class.
The uri's authority is used to determine the host, port, etc. for a filesystem.
- </description>
- </property>
Modifying hdfs-site.xml
The hdfs-site.xml file contains the configuration settings for HDFS daemons.
- NameNode
- Secondary NameNode
- DataNodes.
- CheckPoint - To Find the data Node is active or Dead. (During the specified time intervals DN send the information to the NameNode & NameNode will Monitor this, If there is no such transfer of signals happen from DataNode to Name Node, The NameNode will announce the DataNode as Dead for this purpose the checkpoint is being used)
In the hdfs-site.xml we need to specify default block replication (No., of Replicas for the Data Nodes).
The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.
In my cluster, I have set replicas as 1 only it can be modified based on the needs or requirements. Normally the replicas should be more than 1 in number. In-Order to avoid the Data Loss.
- sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
- Modifying the hdfs-site.xml using Vi Editor, add the contents which are present only between the configuration tags <configuration> </configuration>
- <configuration>
- <property>
- <name>dfs.replication</name>
- <value>1</value>
- <description>Default block replication. The actual number of replications can be specified when the file is created.The default is used if replication is not specified in create time. </description>
- </property>
- <property>
- <name>dfs.namenode.name.dir</name>
- <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
- </property>
- <property>
- <name>dfs.namenode.checkpoint.dir</name>
- <value>file:/usr/local/hadoop_store/hdfs/secondarynamenode</value>
- </property>
- <property>
- <name>dfs.namenode.checkpoint.period</name>
- <value>3600</value>
- </property>
- </configuration>
Modifying the yarn-site.xml file:
- sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
- The yarn-site.xml file contains configuration information that overrides the default values for YARN parameters.
- <configuration>
- <!-- Site specific YARN configuration properties -->
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
- <value>org.apache.hadoop.mapred.ShuffleHandler</value>
- </property>
- </configuration>
Name Node Format
- When we format namenode it formats the meta-data related to data-nodes. By doing that, all the information on the datanodes are lost and they can be reused for new data.
- Normally namenode format will be done only at the first time of Hadoop cluster setup.
- hadoop namenode -format "Command to execute the Name Node Format."
Running All The Process
It will be a difference based on the Node Cluster which we have configured & Installed.
Single Node Cluster
To Run HDFS and YARN separately
- start-yarn.sh (Resource Manager and Node manager)
- start-dfs.sh (namenode, datanode and secondarynamenode)
MultiNode Cluster
- hadoop-daemons.sh start secondarynamenode
- hadoop-daemons.sh start namenode
- hadoop-daemons.sh start datanode
- yarn-daemon.sh start nodemanager
- yarn-daemon.sh start resourcemanager
- mr-jobhistory-daemon.sh start historyserver
How to Check All Daemons are Running?
jps - Java Virtual Machine Process Status Tool it will check all the daemons such as,
- Name Node
- Secondary Name Node
- DataNode
- ResourceManger
- Node Manager is running on the Machine.
Screenshot
How to Browse through the Namenode web UI to fetch information about NameNode & DataNode?