登录查看更多内容

Installation of Apache Hadoop 3.2

Mohit Rao

Senior Manager - Decision Science at LinkedIn

发布日期: 2022年11月10日

+ 关注

1. Pre-requisites?

OS – Ubuntu 16.04 or later?(For this tutorial)
Java 8 Runtime Environment?
Java 8 developer kit for above?
RAM – 8GB or more?

2. Preparation?

Check if the prerequisites are met. Execute the below steps for verification?

$lsb_release -a?

No alt text provided for this image — Check Ubuntu Version

Make sure the version is ubuntu 16.04 or later?

Next, check for the Java version and install Java if required?


$ java -version?

If the version is older than 1.8, install a higher version?


$ sudo apt install openjdk-8-jdk -y?

Now, let’s check the hardware compatibility and make sure it meets the prerequisites.?


$ free -g?


$ df -h?

Make sure we have enough memory and storage space for installing Apache Hadoop 3.2?

3. Installing Hadoop?

Installation of Hadoop can be done using unpacking the software on all machines in the cluster with the help of packaging in the repository. In this case, we will configure one machine as NameNode or master and the other 2 servers in the cluster as DataNode or workers.?

For downloading and installing the relevant versions of Hadoop from the apache repo, we can use the below commands.?


$wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

$ tar xzf hadoop-3.2.1.tar.gz??

Or we can install the packages from the repository?


$ sudo apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn Hadoop-client openssl?

Before configuring Hadoop, it's recommended to connect the machines in the cluster?

4. Cluster Configuration?

All the below steps need to be executed on all the servers (both master and workers) in the cluster.?

Create Hadoop user?


$sudo adduser hadoopuser?

Provide sudo access to the user?


$sudo usermod -aG hadoopuser hadoopuser?
$sudo chown hadoopuser:root -R /usr/local/hadoop/?
$sudo chmod g+rwx -R /usr/local/hadoop/?
$sudo adduser hadoopuser sudo?

Hadoop needs SSH to connect to its local host and other nodes in the cluster. We will accomplish this by generating private/public key pairs.?


$ssh-keygen -t rsa -P “<yourkeypassword>” -f ~/.ssh/id_rsa?

Now copy the keys and change the permission?


$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys??

Verify the above step?


$ssh localhost?$ssh localhost?

Copy the keys from master to workers and from workers to master. You will have to perform the operation on all data nodes and copy the files from the data nodes to the master.?

For this, we have multiple ways to copy the file?

Using SCP?


$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?$scp /.ssh/id_rsa remoteuser@remoteserver:/remote/folder/?

Using RSYNC?


$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder$ rsync -av --delete -e "ssh" /path/to/source remoteuser@remoteserver:/remote/folder/?/?

Using ssh copy?

领英推荐

How to Setup and Configure Hadoop CDH5 on Ubuntu 14.0.4

Santosh Bakliwal 8 年前


$ ssh-copy-id -i ~/.ssh/mykey user@host?

Last step is to add all servers in the cluster (both master and worker) in the hosts directory. This needs to be done on all the servers. Execute the below command and then add other servers in the cluster (including the localhost)?


$sudo vi /etc/hosts

Once the cluster configuration is completed and keys are exchanged, you will able to connect the master server to worker servers and vice versa using ssh-keys. Hadoop will make use of this connectivity for performing actions between servers.?

5. Hadoop Configuration?

First, let’s start with configuring the path for HADOOP_HOME. For this, navigate to profile.d and add the following lines.?


$ sudo vi /etc/profile.d


HADOOP_HOME=/path/to/Hadoop?
export HADOOP_HOME?
export HADOOP_INSTALL=$HADOOP_HOME?
export HADOOP_MAPRED_HOME=$HADOOP_HOME?
export HADOOP_COMMON_HOME=$HADOOP_HOME?
export HADOOP_HDFS_HOME=$HADOOP_HOME?
export YARN_HOME=$HADOOP_HOME?
export HADOOP_COMMON_LIB_NAT_DIR=$HADOOP_HOME/lib/native?
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin??

Next, we have to make changes in the below configuration files before starting the Hadoop service.?

%HADOOP_HOME%/etc/Hadoop/hdfs-site.xml?
%HADOOP_HOME%/etc/Hadoop/core-site.xml?
%HADOOP_HOME%/etc/Hadoop/mapred-site.xml?
%HADOOP_HOME%/etc/Hadoop/yarn-site.xml?

Let’s start with hdfs-site.xml?

hdfs-site.xml config?

To create master-worker configuration, we need to create a directory to store all master node data and worker node data. Execute the below commands to create directories for the same.?


$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/namenode
$sudo mkdir %HADOOP_HOME%/etc/Hadoop/data/dfs/datanode??

Now, let’s edit the hdfs-site.xml and add the details in the XML file.?


$sudo vi %HADOOP_HOME%/etc/Hadoop/hdfs-site.xml?

<property>
  <name>dfs.namenode.name.dir</name>?
  <value>namenode_data_location</value>?
</property>?

<property>?
  <name>dfs.datanode.data.dir</name>?
  <value>datanode_data_location</value>?
</property>??

Allowed properties in hdfs-site.xml file are listed below Property Name?

core-site.xml config?


$sudo vi %HADOOP_HOME%/etc/Hadoop/core-site.xml

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9820</value>
</property>>

Allowed properties in core-site.xml file are listed below

yarn-site.xml?

The yarn-site.xml is used to describe settings related to YARN which includes settings for Node Manager, Resource Manager, Containers, and Application Master. To make config changes,


$sudo vi %HADOOP_HOME%/etc/Hadoop/yarn-site.xml

<property
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
  
<property>
  <name>yarn.nodemanager.aux services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property> 
  
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>127.0.0.1</value>
</property>
  
<property>
  <name>yarn.acl.enable</name>
  <value>0</value>
</property>
  
<property>
  <name>yarn.nodemanager.env-whitelist</name>      <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>>

Allowed properties in yarn-site.xml file are listed below

mapred-site.xml


?$sudo vi %HADOOP_HOME%/etc/Hadoop/mapred-site.xml

<property>
  <name>https://www.dhirubhai.net/redir/phishing-page?url=mapreduce%2eframework%2ename</name>?
  <value>yarn</value>?
  <description>MapReduce framework name</description>?
</property>??

?Allowed properties in mapred-site.xml file are listed below

Reference - Apache Hadoop Documentation

?Update Worker file

List all data node hostnames or IP addresses in your etc/hadoop/workers file, one per line. Helper scripts will use the etc/hadoop/workers file to run commands on many hosts at once. It is not used for any of the Java-based Hadoop configurations.?


$sudo vi /etc/Hadoop/workers

hadoop-slave1
hadoop-slave2

6.??Start Hadoop Cluster

To start a Hadoop cluster you will need to start both the HDFS and YARN cluster. The first time you bring up HDFS, it must be formatted.


$ hdfs namenode -format?

Then, Start the HDFS NameNode with the following command on the designated node as hdfs.


$ hdfs --daemon start namenode

Start the HDFS DataNode with the following command on each designated node as hdfs.


$ hdfs --daemon start datanode

Now start dfs


$ ./start-dfs.sh

Next, we must start the Hadoop Yarn service,


$ ./start-yarn.sh?

To verify if all the Hadoop daemons are started, we can use the following command.


?$jps

7.??Access Hadoop Cluster?

You can now access the Hadoop cluster and utilise master and slave servers for data needs.

WebUI will be enabled by default for namenode, resource manager and mapreduce server.?

NameNode – https://masterhost:9870

ResourceManager – https://resourcemanagerhost:8088

MapReduce Jobhistory – https://jsserver:19888

DataNode – https://masterhost:9864?

Make sure you stop all Hadoop services before shutting down the cluster. You can use the below command for the same.


$ ./stop-all.sh

Reference

https://www.dhirubhai.net/pulse/installation-apache-hadoop-321-ubuntu-dr-virendra-kumar-shrivastava/
https://hadoop.apache.org/docs/r3.2.1/

要查看或添加评论，请登录

Mohit Rao的更多文章

Simple Linear Regression and the Line of best fit

2023年9月25日

Simple Linear Regression and the Line of best fit

Simple linear regression, as the name suggests, is a modeling approach that explores the connection between one…
Transformation story of a New Manager! - Article 2

2021年10月8日

Transformation story of a New Manager! - Article 2

Continuation - To read the first part of the article, click here Almost 4 months into his new role, Sam became…
Transformation story of a New Manager!

2021年9月27日

Transformation story of a New Manager!

Are you new to people management? were you an individual contributor so far? How are you feeling about the…

13 条评论
Expansion of the universe!

2021年7月12日

Expansion of the universe!

Abstract Big bang theory describes the possible expansion of the universe from an initial state, which is possibly a…

10 条评论
Maths behind Naive Bayes

2021年5月20日

Maths behind Naive Bayes

Ever thought of the world before computers? Statisticians were calculating the probabilities manually and predicting…
Are you enjoying your work? Are you Bored!?

2020年6月14日

Are you enjoying your work? Are you Bored!?

A couple of months back, I was flicking through LinkedIn and found “Transformation” as one of the core integrants in…

1 条评论
Text Mining Covid-19 Dataset

2020年3月27日

Text Mining Covid-19 Dataset

After consolidating all scientific papers and public dataset, a word cloud is created and below lines are extracted…

1 条评论
Discern the Intrinsic Motivation!!

2019年11月27日

Discern the Intrinsic Motivation!!

It appears that the performance of the task provides its own intrinsic reward…this drive… may be as basic as the…

1 条评论
Selecting Right Automation Platform.

2019年11月19日

Selecting Right Automation Platform.

With popularity of Automation especially Robotic Process Automation or RPA. Companies started leveraging automation as…

3 条评论
IT Automation Maturity Model

2019年1月17日

IT Automation Maturity Model

Artificial Intelligence, Deep learning, machine learning, Automation these are few jargon's you encounter in the…

6 条评论

See all articles

Installation of Apache Hadoop 3.2

Mohit Rao

Senior Manager - Decision Science at LinkedIn

领英推荐

Reference

Mohit Rao的更多文章

社区洞察

其他会员也浏览了

How to Deploy and Run Hadoop 2 with YARN in Pseudo-Distributed Mode

Apache Hadoop

Pig vs Hive

YARN (Yet Another Resource Negotiator)

Hadoop installation on Ubuntu

Hadoop Installation on Ubuntu.

Configuring Hadoop Cluster and Solving HTTPD Service idempotence challenge using Ansible

How-to: Improve Apache HBase Performance via Data Serialization with Apache Avro

Installation of Hadoop single node cluster and run simple applications like wordcount in windows OS

Hadoop Setup Using Ansible

领英推荐

Reference

Mohit Rao的更多文章

Simple Linear Regression and the Line of best fit

Transformation story of a New Manager! - Article 2

Transformation story of a New Manager!

Expansion of the universe!

Maths behind Naive Bayes

Are you enjoying your work? Are you Bored!?

Text Mining Covid-19 Dataset

Discern the Intrinsic Motivation!!

Selecting Right Automation Platform.

IT Automation Maturity Model

社区洞察

其他会员也浏览了

How to Deploy and Run Hadoop 2 with YARN in Pseudo-Distributed Mode

Apache Hadoop

Pig vs Hive

YARN (Yet Another Resource Negotiator)

Hadoop installation on Ubuntu

Hadoop Installation on Ubuntu.

Configuring Hadoop Cluster and Solving HTTPD Service idempotence challenge using Ansible

How-to: Improve Apache HBase Performance via Data Serialization with Apache Avro

Installation of Hadoop single node cluster and run simple applications like wordcount in windows OS

Hadoop Setup Using Ansible