Installation of Apache Hadoop 3.2

Installation of Apache Hadoop 3.2

1. Pre-requisites?

  • OS – Ubuntu 16.04 or later?(For this tutorial)
  • Java 8 Runtime Environment?
  • Java 8 developer kit for above?
  • RAM – 8GB or more?

2. Preparation?

Check if the prerequisites are met. Execute the below steps for verification?

$lsb_release -a?        
No alt text provided for this image
Check Ubuntu Version




Make sure the version is ubuntu 16.04 or later?

Next, check for the Java version and install Java if required?


$ java -version?        
No alt text provided for this image
Verify the JAVA version



If the version is older than 1.8, install a higher version?


$ sudo apt install openjdk-8-jdk -y?        

Now, let’s check the hardware compatibility and make sure it meets the prerequisites.?


$ free -g?        
No alt text provided for this image
Verify memory requirements




$ df -h?        

Make sure we have enough memory and storage space for installing Apache Hadoop 3.2?

3. Installing Hadoop?

Installation of Hadoop can be done using unpacking the software on all machines in the cluster with the help of packaging in the repository. In this case, we will configure one machine as NameNode or master and the other 2 servers in the cluster as DataNode or workers.?

For downloading and installing the relevant versions of Hadoop from the apache repo, we can use the below commands.?


$wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

$ tar xzf hadoop-3.2.1.tar.gz??        

Or we can install the packages from the repository?


$ sudo apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn Hadoop-client openssl?        

Before configuring Hadoop, it's recommended to connect the machines in the cluster?

4. Cluster Configuration?

All the below steps need to be executed on all the servers (both master and workers) in the cluster.?

  • Create Hadoop user?


$sudo adduser hadoopuser?        
No alt text provided for this image






  • Provide sudo access to the user?


$sudo usermod -aG hadoopuser hadoopuser?
$sudo chown hadoopuser:root -R /usr/local/hadoop/?
$sudo chmod g+rwx -R /usr/local/hadoop/?
$sudo adduser hadoopuser sudo?        
No alt text provided for this image



  • Hadoop needs SSH to connect to its local host and other nodes in the cluster. We will accomplish this by generating private/public key pairs.?


$ssh-keygen -t rsa -P “<yourkeypassword>” -f ~/.ssh/id_rsa?        
No alt text provided for this image



Now copy the keys and change the permission?


$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys??        

  • Verify the above step?


$ssh localhost?$ssh localhost?        
No alt text provided for this image



  • Copy the keys from master to workers and from workers to master. You will have to perform the operation on all data nodes and copy the files from the data nodes to the master.?

For this, we have multiple ways to copy the file?

Using SCP?


$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?        

Using RSYNC?


$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder/?/?        

Using ssh copy?


$ ssh-copy-id -i ~/.ssh/mykey user@host?        

  • Last step is to add all servers in the cluster (both master and worker) in the hosts directory. This needs to be done on all the servers. Execute the below command and then add other servers in the cluster (including the localhost)?


$sudo vi /etc/hosts        

Once the cluster configuration is completed and keys are exchanged, you will able to connect the master server to worker servers and vice versa using ssh-keys. Hadoop will make use of this connectivity for performing actions between servers.?

5. Hadoop Configuration?

First, let’s start with configuring the path for HADOOP_HOME. For this, navigate to profile.d and add the following lines.?


$ sudo vi /etc/profile.d


HADOOP_HOME=/path/to/Hadoop?
export HADOOP_HOME?
export HADOOP_INSTALL=$HADOOP_HOME?
export HADOOP_MAPRED_HOME=$HADOOP_HOME?
export HADOOP_COMMON_HOME=$HADOOP_HOME?
export HADOOP_HDFS_HOME=$HADOOP_HOME?
export YARN_HOME=$HADOOP_HOME?
export HADOOP_COMMON_LIB_NAT_DIR=$HADOOP_HOME/lib/native?
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin??        

Next, we have to make changes in the below configuration files before starting the Hadoop service.?


  • %HADOOP_HOME%/etc/Hadoop/hdfs-site.xml?
  • %HADOOP_HOME%/etc/Hadoop/core-site.xml?
  • %HADOOP_HOME%/etc/Hadoop/mapred-site.xml?
  • %HADOOP_HOME%/etc/Hadoop/yarn-site.xml?


Let’s start with hdfs-site.xml?

  • hdfs-site.xml config?

To create master-worker configuration, we need to create a directory to store all master node data and worker node data. Execute the below commands to create directories for the same.?


$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/namenode
$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/datanode??        

Now, let’s edit the hdfs-site.xml and add the details in the XML file.?


$sudo vi %HADOOP_HOME%/etc/Hadoop/hdfs-site.xml?

<property>
  <name>dfs.namenode.name.dir</name>?
  <value>namenode_data_location</value>?
</property>?

<property>?
  <name>dfs.datanode.data.dir</name>?
  <value>datanode_data_location</value>?
</property>??        

Allowed properties in hdfs-site.xml file are listed below Property Name?

No alt text provided for this image
Reference - Apache Hadoop Documentation

  • core-site.xml config?


$sudo vi %HADOOP_HOME%/etc/Hadoop/core-site.xml

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9820</value>
</property>>        

Allowed properties in core-site.xml file are listed below

No alt text provided for this image
Reference - Apache Hadoop Documentation

  • yarn-site.xml?

The yarn-site.xml is used to describe settings related to YARN which includes settings for Node Manager, Resource Manager, Containers, and Application Master. To make config changes,


$sudo vi %HADOOP_HOME%/etc/Hadoop/yarn-site.xml

<property
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
  
<property>
  <name>yarn.nodemanager.aux services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property> 
  
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>127.0.0.1</value>
</property>
  
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
  
<property>
  <name>yarn.nodemanager.env-whitelist</name>      <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>>        

Allowed properties in yarn-site.xml file are listed below

No alt text provided for this image
No alt text provided for this image
Reference - Apache Hadoop Documentation


  • mapred-site.xml


?$sudo vi %HADOOP_HOME%/etc/Hadoop/mapred-site.xml

<property>
  <name>https://www.dhirubhai.net/redir/phishing-page?url=mapreduce%2eframework%2ename</name>?
  <value>yarn</value>?
  <description>MapReduce framework name</description>?
</property>??        

?Allowed properties in mapred-site.xml file are listed below

Reference - Apache Hadoop Documentation

No alt text provided for this image
Reference - Apache Hadoop Documentation

  • ?Update Worker file

List all data node hostnames or IP addresses in your etc/hadoop/workers file, one per line. Helper scripts will use the etc/hadoop/workers file to run commands on many hosts at once. It is not used for any of the Java-based Hadoop configurations.?


$sudo vi /etc/Hadoop/workers

hadoop-slave1
hadoop-slave2        

6.??Start Hadoop Cluster

To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. The first time you bring up HDFS, it must be formatted.


$ hdfs namenode -format?        

Then, Start the HDFS NameNode with the following command on the designated node as hdfs.


$ hdfs --daemon start namenode        

Start the HDFS DataNode with the following command on each designated node as hdfs.


$ hdfs --daemon start datanode        

Now start dfs


$ ./start-dfs.sh        

Next, we must start the Hadoop Yarn service,


$ ./start-yarn.sh?        
No alt text provided for this image
Starting Hadoop cluster



To verify if all the Hadoop daemons are started, we can use the following command.


?$jps        

7.??Access Hadoop Cluster?

You can now access the Hadoop cluster and utilise master and slave servers for data needs.

WebUI will be enabled by default for namenode, resource manager and mapreduce server.?


NameNode – https://masterhost:9870

ResourceManager – https://resourcemanagerhost:8088

MapReduce Jobhistory – https://jsserver:19888

DataNode – https://masterhost:9864?


Make sure you stop all Hadoop services before shutting down the cluster. You can use the below command for the same.


$ ./stop-all.sh        

Reference

  1. https://www.dhirubhai.net/pulse/installation-apache-hadoop-321-ubuntu-dr-virendra-kumar-shrivastava/
  2. https://hadoop.apache.org/docs/r3.2.1/

要查看或添加评论,请登录

Mohit Rao的更多文章

  • Simple Linear Regression and the Line of best fit

    Simple Linear Regression and the Line of best fit

    Simple linear regression, as the name suggests, is a modeling approach that explores the connection between one…

  • Transformation story of a New Manager! - Article 2

    Transformation story of a New Manager! - Article 2

    Continuation - To read the first part of the article, click here Almost 4 months into his new role, Sam became…

  • Transformation story of a New Manager!

    Transformation story of a New Manager!

    Are you new to people management? were you an individual contributor so far? How are you feeling about the…

    13 条评论
  • Expansion of the universe!

    Expansion of the universe!

    Abstract Big bang theory describes the possible expansion of the universe from an initial state, which is possibly a…

    10 条评论
  • Maths behind Naive Bayes

    Maths behind Naive Bayes

    Ever thought of the world before computers? Statisticians were calculating the probabilities manually and predicting…

  • Are you enjoying your work? Are you Bored!?

    Are you enjoying your work? Are you Bored!?

    A couple of months back, I was flicking through LinkedIn and found “Transformation” as one of the core integrants in…

    1 条评论
  • Text Mining Covid-19 Dataset

    Text Mining Covid-19 Dataset

    After consolidating all scientific papers and public dataset, a word cloud is created and below lines are extracted…

    1 条评论
  • Discern the Intrinsic Motivation!!

    Discern the Intrinsic Motivation!!

    It appears that the performance of the task provides its own intrinsic reward…this drive… may be as basic as the…

    1 条评论
  • Selecting Right Automation Platform.

    Selecting Right Automation Platform.

    With popularity of Automation especially Robotic Process Automation or RPA. Companies started leveraging automation as…

    3 条评论
  • IT Automation Maturity Model

    IT Automation Maturity Model

    Artificial Intelligence, Deep learning, machine learning, Automation these are few jargon's you encounter in the…

    6 条评论

社区洞察

其他会员也浏览了