Installation of Apache Hadoop 3.2
1. Pre-requisites?
2. Preparation?
Check if the prerequisites are met. Execute the below steps for verification?
$lsb_release -a?
Make sure the version is ubuntu 16.04 or later?
Next, check for the Java version and install Java if required?
$ java -version?
If the version is older than 1.8, install a higher version?
$ sudo apt install openjdk-8-jdk -y?
Now, let’s check the hardware compatibility and make sure it meets the prerequisites.?
$ free -g?
$ df -h?
Make sure we have enough memory and storage space for installing Apache Hadoop 3.2?
3. Installing Hadoop?
Installation of Hadoop can be done using unpacking the software on all machines in the cluster with the help of packaging in the repository. In this case, we will configure one machine as NameNode or master and the other 2 servers in the cluster as DataNode or workers.?
For downloading and installing the relevant versions of Hadoop from the apache repo, we can use the below commands.?
$wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
$ tar xzf hadoop-3.2.1.tar.gz??
Or we can install the packages from the repository?
$ sudo apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn Hadoop-client openssl?
Before configuring Hadoop, it's recommended to connect the machines in the cluster?
4. Cluster Configuration?
All the below steps need to be executed on all the servers (both master and workers) in the cluster.?
$sudo adduser hadoopuser?
$sudo usermod -aG hadoopuser hadoopuser?
$sudo chown hadoopuser:root -R /usr/local/hadoop/?
$sudo chmod g+rwx -R /usr/local/hadoop/?
$sudo adduser hadoopuser sudo?
$ssh-keygen -t rsa -P “<yourkeypassword>” -f ~/.ssh/id_rsa?
Now copy the keys and change the permission?
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys??
$ssh localhost?$ssh localhost?
For this, we have multiple ways to copy the file?
Using SCP?
$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?
Using RSYNC?
$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder/?/?
Using ssh copy?
$ ssh-copy-id -i ~/.ssh/mykey user@host?
$sudo vi /etc/hosts
Once the cluster configuration is completed and keys are exchanged, you will able to connect the master server to worker servers and vice versa using ssh-keys. Hadoop will make use of this connectivity for performing actions between servers.?
5. Hadoop Configuration?
First, let’s start with configuring the path for HADOOP_HOME. For this, navigate to profile.d and add the following lines.?
$ sudo vi /etc/profile.d
HADOOP_HOME=/path/to/Hadoop?
export HADOOP_HOME?
export HADOOP_INSTALL=$HADOOP_HOME?
export HADOOP_MAPRED_HOME=$HADOOP_HOME?
export HADOOP_COMMON_HOME=$HADOOP_HOME?
export HADOOP_HDFS_HOME=$HADOOP_HOME?
export YARN_HOME=$HADOOP_HOME?
export HADOOP_COMMON_LIB_NAT_DIR=$HADOOP_HOME/lib/native?
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin??
Next, we have to make changes in the below configuration files before starting the Hadoop service.?
Let’s start with hdfs-site.xml?
To create master-worker configuration, we need to create a directory to store all master node data and worker node data. Execute the below commands to create directories for the same.?
$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/namenode
$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/datanode??
Now, let’s edit the hdfs-site.xml and add the details in the XML file.?
$sudo vi %HADOOP_HOME%/etc/Hadoop/hdfs-site.xml?
<property>
<name>dfs.namenode.name.dir</name>?
<value>namenode_data_location</value>?
</property>?
<property>?
<name>dfs.datanode.data.dir</name>?
<value>datanode_data_location</value>?
</property>??
Allowed properties in hdfs-site.xml file are listed below Property Name?
$sudo vi %HADOOP_HOME%/etc/Hadoop/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9820</value>
</property>>
Allowed properties in core-site.xml file are listed below
The yarn-site.xml is used to describe settings related to YARN which includes settings for Node Manager, Resource Manager, Containers, and Application Master. To make config changes,
$sudo vi %HADOOP_HOME%/etc/Hadoop/yarn-site.xml
<property
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>>
Allowed properties in yarn-site.xml file are listed below
?$sudo vi %HADOOP_HOME%/etc/Hadoop/mapred-site.xml
<property>
<name>https://www.dhirubhai.net/redir/phishing-page?url=mapreduce%2eframework%2ename</name>?
<value>yarn</value>?
<description>MapReduce framework name</description>?
</property>??
?Allowed properties in mapred-site.xml file are listed below
Reference - Apache Hadoop Documentation
List all data node hostnames or IP addresses in your etc/hadoop/workers file, one per line. Helper scripts will use the etc/hadoop/workers file to run commands on many hosts at once. It is not used for any of the Java-based Hadoop configurations.?
$sudo vi /etc/Hadoop/workers
hadoop-slave1
hadoop-slave2
6.??Start Hadoop Cluster
To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. The first time you bring up HDFS, it must be formatted.
$ hdfs namenode -format?
Then, Start the HDFS NameNode with the following command on the designated node as hdfs.
$ hdfs --daemon start namenode
Start the HDFS DataNode with the following command on each designated node as hdfs.
$ hdfs --daemon start datanode
Now start dfs
$ ./start-dfs.sh
Next, we must start the Hadoop Yarn service,
$ ./start-yarn.sh?
To verify if all the Hadoop daemons are started, we can use the following command.
?$jps
7.??Access Hadoop Cluster?
You can now access the Hadoop cluster and utilise master and slave servers for data needs.
WebUI will be enabled by default for namenode, resource manager and mapreduce server.?
NameNode – https://masterhost:9870
ResourceManager – https://resourcemanagerhost:8088
MapReduce Jobhistory – https://jsserver:19888
DataNode – https://masterhost:9864?
Make sure you stop all Hadoop services before shutting down the cluster. You can use the below command for the same.
$ ./stop-all.sh
Reference