登录查看更多内容

Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws

Deepak Sharma

3 x RedHat Certified Engineer (EX200, EX294, EX180) || DevOps Engineer || Docker || K8s || Ansible || Linux || Git || Github || Gitlab || Terraform || Jenkins || Cloud Computing || AWS

发布日期: 2020年11月18日

+ 关注

In this article , we learn to contribute limited storage of datanode to the namenode by using LVM partition concept.

First let’s understand the purpose of this task

In hadoop cluster , slaves/data node contributes their full storage to masternode but suppose in some of the cases if any slave want to contribute only some sort of limited kind of storage then it can do like that .So now let’s see how slave can contribute his limited storage to masternode.

In this article , I will also cover :-

How to create Hadoop cluster on aws?
Configure Namenode/Master on aws
Configure Datanode/Slave on aws
How to integrate LVM with Hadoop
How we can increase contributed storage of datanode
How we can decrease contributed storage of datanode

Before start , first we will start introduction section . In this section , you will learn some definitions that will be use in this article .

Introduction :-

Hadoop Namenode :- Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to datanodes.
Datanode :- DataNode is responsible for storing the actual data in HDFS. NameNode and DataNode are in constant communication. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
LVM :- Logical Volume Management enables the combining of multiple individual hard drives or disk partitions into a single volume group (VG). That volume group can then be subdivided into logical volumes (LV) or used as a single large volume.

LVM helps to provides Elasticity to the storage Device and it’s an advance version of partition.

Let's Started:-

In this task , I use two aws instance . One for master/Namenode and other for datanode . For making hadoop cluster we require to install jdk and hadoop on the top of both NameNode as well as DataNode.

Step-A Install JDK and Hadoop on Namenode

Here I use redhat 8.0 as base os. As we know hadoop will work on java and our instance is running on aws . I used redhat 8.0 ami to launch instance on aws.So first we will transfer jdk and hadoop(version 1) software on aws by using winscp software.

Hadoop (version 1 ) is not comfortable on every java version . So here I provide you link of jdk and hadoop(version 1) .

Get jdk software Get Hadoop Software Get winscp Software

Install Jdk

After copy jdk and hadoop software on aws instance (namenode).First we go to loaction (where software are copied) . In my case , location is /home/ec2-user .Then install jdk software on namenode by below command:-

rpm -ivh jdk-8u171-linux-x64.rpm

We can check by jps command

2. Install Hadoop :-

Hadoop works on the java . So first we install java . Now we can install hadoop by

rpm -ivh  hadoop-1.2.1-1.x86_64.rpm  --force

Now we have to check that this hadoop version 1 is comfortable with java . We can verify by :-

hadoop version

Here we did not get any error . So our hadoop is working fine with java.

Step-B Configure Namenode :-

There are few steps to configure namenode:-

Install jdk and Hadoop
Create a directory in / drive of redhat
configure hadoop files
format namenode directory
Start namenode service

1.Install jdk and Hadoop :- Installation of jdk and hadoop )is already done in above step.

2. Create a directory in / drive of redhat :- Here I create a directory /nn in namenode

mkdir /nn

3. Configure hadoop files:-The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml and core-site.xml

hdfs-site.xml

core-site.xml

AWS gives two ips to instance :- public and private.
As public ip can connect to only public ip and private ip only connect to private ip.
So if I allocate private ip to namenode ,then no one can connect to namenode. But if I give public ip ,then hadoop does not work .Because Hadoop is only know private ip of instance.

That's why here I use 0.0.0.0 .So now anybody can connect to namenode by using both public and private ip .

4. Format namenode directory:-

As we know , to save our data permanent ,we have to format our directory. So we format namenode directory by:-

hadoop namenode  -format

5. Start namenode service

We can start namenode service . So datanode can connect to our namenode at 9001 port number.

hadoop-daemon.sh start namenode

We can verify by jps command .We can see in above screenshot that our Namenode service is also running .

Step-C Configure datanode

There are few steps to configure Datanode:-

Install jdk and Hadoop software
Create a directory in / drive of redhat
configure hadoop files
Start datanode service

1.Install jdk and Hadoop software:- Installation of jdk and hadoop )is already done in above step A. There is same process of installation on namenode and datanode.

2. Create a directory in / drive of redhat :-Here I create a directory /dn in namenode

mkdir /dn

3. Configure hadoop files:- The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml and core-site.xml

hdfs-site.xml

core-site.xml

In datanode , we add namenode public ip given by aws.

4. Start datanode service

To start datanode service ,we use

hadoop-daemon.sh start datanode

We can verify by jps command .We can see in above screenshot that our Datanode service is also running .

Note :- If datanode does not connect to namenode , then it may be aws security group issue. So in Namenode security group allow Datanode by adding rule.

We can also see from namenode that how many datanode are connected and how much they share their storage to namenode.

hadoop dfsadmin -report

Here we see our datanode share their entire storage to namenode .Now in next step we see that how we can contribute limited storage to namenode using LVM.

Step-D Integration of Hadoop and LVM

It is good practice to add external storage device to store our critical data . So if our system corrupt , then our data would not be lost.

Here I attach an extra volume from EBS AWS service .To make simple , I divide this process into steps . So you can easily understand

Step-1 Attach two Extra volume to datanode instance

First create two volume by EBS:- Volume-1 (4 GB) and Volume-2 (8 GB)

Now attach both volume to our datanode instance( both(extra-volume and datanode instance) must be running in same subnet (ex:-1a))

Step-2 Now use LVM concept for partition of extra volume

By using LVM concept , we can add both extra volume and use as storage.

In LVM concept , First we create PV of our extra volume and then create VG .By VG , we can access storage by LV .
The size of LV depends on our requirement . The maximum size of LV is equal to VG and we can increase size of VG by attaching more PV . So in this way ,we can get unlimited storage .
Now let's see how we can implement this setup.

In LVM , there are few steps to get storage:-

a . Create PV

b. Create VG

c. Create LV

d. Format LV

e. Mount LV to a folder

We can see all attach volume to instance by "fdisk -l"

Here we see our both volume 4Gb and 8 Gb.

Step-a Create PV

To create pv , first install LVM software by using

 yum install lvm2-8:2.03.09-5.el8.x86_64

Now we create pv of both volume (4 & 8)

pvcreate /dev/xvdf   // for 8 G
pvcreate  /dev/xvdg   // for 4 G

We can verify by pvdisplay command.

Step-b Create VG

vgcreate iiec /dev/xvdf /dev/xvdg

We can see iiec vg by "vgdisplay iiec" command (in above screenshot).

Step-c Create LV

First I want to share only 5 GiB to namenode . So I create a LV of 5 Gb.

lvcreate --size 5GiB --name MyLV iiec

We can see our created lv (MyLV) by using

lvdisplay iiec/MyLV

Step-d Format LV

mkfs.ext4 /dev/iiec/MyLV

Step-e Mount LV to a folder

First create a directory/folder by using

mkdir /root/limited

Then mount the LV to the folder by

mount /dev/iiec/MyLV   /root/limited

We can see all mounted point by 'df -h 'command

Step-3 Replace our mounted directory in configuration file

It is good practice , to first stop datanode service before changing hadoop configration files. We can stop datanode service

Now goto /etc/hadoop location

hdfs-site.xml

core-site.xml

Start datanode service

hadoop-daemon.sh start datanode

Now From namenode we can see how much storage we got from datanode

In this way , our slave can contribute his limited storage to masternode.

Step-E How we can Increase / Decrease contributed storage of datanode

1.Increase LVM storage

Let' s see how we can increase our contributed storage of datanode without losing our older data.
We share 5 GiB to namenode .Now I want to share 10 GiB .
For this we have to increase size of LV(MyLV)

We can increase LV size by using below command

lvextend --size +5GiB  /dev/iiec/MyLV

We can also see our LV size by

lvdisplay  iiec/MyLV

We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.
Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
But If we format ,then our entire older data will be lost . So we have to format only un-formatted data by We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.

Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
But if format LV, then we will lost entire our older data. So we have to only format the un-formatted part by using:-

resize2fs  /dev/iiec/MyLV

This command will only format our half remaining LV and we will not lost our older data.

Now we can also see from namenode that our storage increase on the fly without stopping services of datanode and namenode.

2. Decrease LVM Storage/Shrink Volume of Datanode :-

In this step ,you can learn how we can reduce our volume that will be shared to namenode by datanode.

To reduce LVM storage ,we have to follow some steps .So that our previous data would be safe.

Stop datanode service
Unmount the Logical volume
Clean/Scan
Resize the LV
Reduce the LV storage
Mount back to the directory
Start datanode Service

1. Stop datanode service

hadoop-daemon.sh stop datanode

2. Unmount the Logical volume

Before unmount the directory , I will store some files in the directory to see that our data will be secured. I create a file lw.html and some folders are pre-created inside the LVM storage.

But before umount close all files of that directory(limited) Now we unmount the limited directory . by

umount /root/limited

3. Scan/Clean

When we store and read a file , then it may also store garbage values. So first we clean these garbage values. For this we run

e2fsck -f /dev/iie/MyLV

4. Resize the LV

This is the most important step. As our data is not saved continuously . So if we reduce LV ,then it might lost some data. So first we resize the data . It means re-create inode table till 6G.

resize2fs  /dev/iiec/MyLV 6G

Now our data will be saved in 6GB . So if we reduce (total-6GB=4GB) ,then our data will be saved otherwise we lost / corrupt our data. It means the 4Gb space is empty. So if we remove that space then we don't lost our data.

5. Reduce the LV storage

Here I shrink the whole volume into 6GB storage. So that our older data would be safe.

lvreduce  -L 6G /dev/iiec/MyLV
or
lvreduce  --size -4G /dev/iiec/MyLV

Both command will give same result . This command will reduce the LV size from 10 G to 6 G .

We can see our LV new size by "lvdispaly iiec/MyLV"

6. Mount back to the directory

Now again mount the LV storage to the directory /root/limited

mount /dev/iiec/MyLV   /root/limited

7. Start datanode service

hadoop-daemon.sh start datanode

Now we can verify from the namenode

We can see that our data is also safe

Hence in this way company can increase and decrease the datanode storage without losing their older data .

Thanks..

要查看或添加评论，请登录

Deepak Sharma的更多文章

Jenkins Dynamic Provisioning

2021年11月8日

Jenkins Dynamic Provisioning

Objectives In this article , We will see how we can create dynamic slave on the fly when job come and attach to the…

1 条评论
OSPF Routing Protocol using Dijkstra Algorithm

2021年11月3日

OSPF Routing Protocol using Dijkstra Algorithm

Objectives:- In this article, We will learn about Dijkstra Algorithm and Open Short Path First(OSPF) Routing Protocol .…

1 条评论
MongoDB Case study: Forbes

2021年11月3日

MongoDB Case study: Forbes

Objective In this article , we see how MongoDB Cloud Migration Helps World's Biggest Media Brand Continue To Set…
Vehicle’s Number Plate Detection using CNN model using python and Flask API…

2021年8月25日

Vehicle’s Number Plate Detection using CNN model using python and Flask API…

In this article, I am going to show you how you can create CNN Model or Deep Learning Model for Vehicle’s Number Plate…

8 条评论
K-means Clustering and its real use cases in security domain

2021年8月7日

K-means Clustering and its real use cases in security domain

Objectives:- In this article, we will see about the Kmean algorithm and how Kmean algorithm helps in security domain to…
JavaScript:- Industry Use-cases

2021年6月25日

JavaScript:- Industry Use-cases

Objective In this article , we will learn about the JavaScript and the use-cases of JavaScript. How Industries utilizes…
Confusion Matrix and Cyber Security

2021年6月6日

Confusion Matrix and Cyber Security

Objectives:- In this article , we will see about confusion matrix and the use of confusion matrix . Also we see how…
Self-Reflection of MongoDB-Workshop

2021年5月9日

Self-Reflection of MongoDB-Workshop

# Day1 (1st May 2021) ?? Introduction of the file system? ??The data we will stored in file and that file we basically…
OpenShift case study:- Cisco

2021年5月7日

OpenShift case study:- Cisco

Cisco’s success depends on its ability to quickly deliver innovative IT products and solutions to customers. Delays can…
Industry Use cases of Jenkins:- Prepl

2021年5月6日

Industry Use cases of Jenkins:- Prepl

In 2021, When industries are running towards automation, adopting different DevOps tools to solve their industrial…

See all articles

Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws

Deepak Sharma

3 x RedHat Certified Engineer (EX200, EX294, EX180) || DevOps Engineer || Docker || K8s || Ansible || Linux || Git || Github || Gitlab || Terraform || Jenkins || Cloud Computing || AWS

First let’s understand the purpose of this task

Introduction :-

Let's Started:-

Step-A Install JDK and Hadoop on Namenode

Step-B Configure Namenode :-

Step-C Configure datanode

Note :- If datanode does not connect to namenode , then it may be aws security group issue. So in Namenode security group allow Datanode by adding rule.

Here we see our datanode share their entire storage to namenode .Now in next step we see that how we can contribute limited storage to namenode using LVM.

Step-D Integration of Hadoop and LVM

Step-1 Attach two Extra volume to datanode instance

Step-2 Now use LVM concept for partition of extra volume

Step-3 Replace our mounted directory in configuration file

In this way , our slave can contribute his limited storage to masternode.

Step-E How we can Increase / Decrease contributed storage of datanode

1.Increase LVM storage

Now we can also see from namenode that our storage increase on the fly without stopping services of datanode and namenode.

2. Decrease LVM Storage/Shrink Volume of Datanode :-

Hence in this way company can increase and decrease the datanode storage without losing their older data .

Deepak Sharma的更多文章

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

The History of Hadoop and Big Data

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR

Hadoop Architecture Made Easy!

Configuring a multinode Hadoop Cluster over AWS using : Ansible

Play by Play: Hadoop.AI.ML.

Integrate LVM with Hadoop and providing Elasticity to DataNode Storage

INTEGRATION OF LVM Partition WITH HADOOP CLUSTER

First let’s understand the purpose of this task

Introduction :-

Let's Started:-

Step-A Install JDK and Hadoop on Namenode

Step-B Configure Namenode :-

Step-C Configure datanode

Note :- If datanode does not connect to namenode , then it may be aws security group issue. So in Namenode security group allow Datanode by adding rule.

Here we see our datanode share their entire storage to namenode .Now in next step we see that how we can contribute limited storage to namenode using LVM.

Step-D Integration of Hadoop and LVM

Step-1 Attach two Extra volume to datanode instance

Step-2 Now use LVM concept for partition of extra volume

Step-3 Replace our mounted directory in configuration file

In this way , our slave can contribute his limited storage to masternode.

Step-E How we can Increase / Decrease contributed storage of datanode

1.Increase LVM storage

Now we can also see from namenode that our storage increase on the fly without stopping services of datanode and namenode.

2. Decrease LVM Storage/Shrink Volume of Datanode :-

Hence in this way company can increase and decrease the datanode storage without losing their older data .

Deepak Sharma的更多文章

Jenkins Dynamic Provisioning

OSPF Routing Protocol using Dijkstra Algorithm

MongoDB Case study: Forbes

Vehicle’s Number Plate Detection using CNN model using python and Flask API…

K-means Clustering and its real use cases in security domain

JavaScript:- Industry Use-cases

Confusion Matrix and Cyber Security

Self-Reflection of MongoDB-Workshop

OpenShift case study:- Cisco

Industry Use cases of Jenkins:- Prepl

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

The History of Hadoop and Big Data

Starting My Journey into Big Data: Apache Spark, Apache Hadoop, and AWS EMR

Hadoop Architecture Made Easy!

Configuring a multinode Hadoop Cluster over AWS using : Ansible

Play by Play: Hadoop.AI.ML.

Integrate LVM with Hadoop and providing Elasticity to DataNode Storage

INTEGRATION OF LVM Partition WITH HADOOP CLUSTER