Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws

Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws


In this article , we learn to contribute limited storage of datanode to the namenode by using LVM partition concept.

First let’s understand the purpose of this task

In hadoop cluster , slaves/data node contributes their full storage to masternode but suppose in some of the cases if any slave want to contribute only some sort of limited kind of storage then it can do like that .So now let’s see how slave can contribute his limited storage to masternode.

In this article , I will also cover :-

  1. How to create Hadoop cluster on aws?
  2. Configure Namenode/Master on aws
  3. Configure Datanode/Slave on aws
  4. How to integrate LVM with Hadoop
  5. How we can increase contributed storage of datanode
  6. How we can decrease contributed storage of datanode

Before start , first we will start introduction section . In this section , you will learn some definitions that will be use in this article .

Introduction :-

  1. Hadoop Namenode :- Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to datanodes.
  2. Datanode :- DataNode is responsible for storing the actual data in HDFS. NameNode and DataNode are in constant communication. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
  3. LVM :-  Logical Volume Management enables the combining of multiple individual hard drives or disk partitions into a single volume group (VG). That volume group can then be subdivided into logical volumes (LV) or used as a single large volume.


No alt text provided for this image
  • LVM helps to provides Elasticity to the storage Device and it’s an advance version of partition.

Let's Started:-

In this task , I use two aws instance . One for master/Namenode and other for datanode . For making hadoop cluster we require to install jdk and hadoop on the top of both NameNode as well as DataNode.

Step-A Install JDK and Hadoop on Namenode

Here I use redhat 8.0 as base os. As we know hadoop will work on java and our instance is running on aws . I used redhat 8.0 ami to launch instance on aws.So first we will transfer jdk and hadoop(version 1) software on aws by using winscp software.

No alt text provided for this image

Hadoop (version 1 ) is not comfortable on every java version . So here I provide you link of jdk and hadoop(version 1) .

Get jdk software Get Hadoop Software Get winscp Software

  1. Install Jdk

After copy jdk and hadoop software on aws instance (namenode).First we go to loaction (where software are copied) . In my case , location is /home/ec2-user .Then install jdk software on namenode by below command:-

rpm -ivh jdk-8u171-linux-x64.rpm
No alt text provided for this image

We can check by jps command

No alt text provided for this image

2. Install Hadoop :-

Hadoop works on the java . So first we install java . Now we can install hadoop by

rpm -ivh  hadoop-1.2.1-1.x86_64.rpm  --force
No alt text provided for this image

Now we have to check that this hadoop version 1 is comfortable with java . We can verify by :-

hadoop version
No alt text provided for this image

Here we did not get any error . So our hadoop is working fine with java.

Step-B Configure Namenode :-

There are few steps to configure namenode:-

  1. Install jdk and Hadoop
  2. Create a directory in / drive of redhat
  3. configure hadoop files
  4. format namenode directory
  5. Start namenode service

1.Install jdk and Hadoop :- Installation of jdk and hadoop )is already done in above step.

2. Create a directory in / drive of redhat :- Here I create a directory /nn in namenode

mkdir /nn

3. Configure hadoop files:-The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml  and core-site.xml  

hdfs-site.xml

No alt text provided for this image

core-site.xml  

No alt text provided for this image
  • AWS gives two ips to instance :- public and private.
  • As public ip can connect to only public ip and private ip only connect to private ip.
  • So if I allocate private ip to namenode ,then no one can connect to namenode. But if I give public ip ,then hadoop does not work .Because Hadoop is only know private ip of instance.

That's why here I use 0.0.0.0 .So now anybody can connect to namenode by using both public and private ip .

4. Format namenode directory:-

As we know , to save our data permanent ,we have to format our directory. So we format namenode directory by:-

hadoop namenode  -format
No alt text provided for this image

5. Start namenode service

We can start namenode service . So datanode can connect to our namenode at 9001 port number.

hadoop-daemon.sh start namenode
No alt text provided for this image

We can verify by jps command .We can see in above screenshot that our Namenode service is also running .

Step-C Configure datanode

There are few steps to configure Datanode:-

  1. Install jdk and Hadoop software
  2. Create a directory in / drive of redhat
  3. configure hadoop files
  4. Start datanode service

1.Install jdk and Hadoop software:- Installation of jdk and hadoop )is already done in above step A. There is same process of installation on namenode and datanode.

2. Create a directory in / drive of redhat :-Here I create a directory /dn in namenode

mkdir /dn

3. Configure hadoop files:- The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml  and core-site.xml  

hdfs-site.xml

No alt text provided for this image

core-site.xml 

No alt text provided for this image

In datanode , we add namenode public ip given by aws.

4. Start datanode service

To start datanode service ,we use

hadoop-daemon.sh start datanode
No alt text provided for this image

We can verify by jps command .We can see in above screenshot that our Datanode service is also running .

Note :- If datanode does not connect to namenode , then it may be aws security group issue. So in Namenode security group allow Datanode by adding rule.

We can also see from namenode that how many datanode are connected and how much they share their storage to namenode.

hadoop dfsadmin -report
No alt text provided for this image

Here we see our datanode share their entire storage to namenode .Now in next step we see that how we can contribute limited storage to namenode using LVM.

Step-D Integration of Hadoop and LVM

It is good practice to add external storage device to store our critical data . So if our system corrupt , then our data would not be lost.

Here I attach an extra volume from EBS AWS service .To make simple , I divide this process into steps . So you can easily understand

Step-1 Attach two Extra volume to datanode instance

First create two volume by EBS:- Volume-1 (4 GB) and Volume-2 (8 GB)

No alt text provided for this image


Now attach both volume to our datanode instance( both(extra-volume and datanode instance) must be running in same subnet (ex:-1a))

No alt text provided for this image
No alt text provided for this image


Step-2 Now use LVM concept for partition of extra volume

By using LVM concept , we can add both extra volume and use as storage.

No alt text provided for this image
  • In LVM concept , First we create PV of our extra volume and then create VG .By VG , we can access storage by LV .
  • The size of LV depends on our requirement . The maximum size of LV is equal to VG and we can increase size of VG by attaching more PV . So in this way ,we can get unlimited storage .
  • Now let's see how we can implement this setup.

In LVM , there are few steps to get storage:-

a . Create PV

b. Create VG

c. Create LV

d. Format LV

e. Mount LV to a folder

We can see all attach volume to instance by "fdisk -l"

No alt text provided for this image

Here we see our both volume 4Gb and 8 Gb.

Step-a Create PV

To create pv , first install LVM software by using

 yum install lvm2-8:2.03.09-5.el8.x86_64

No alt text provided for this image

Now we create pv of both volume (4 & 8)

pvcreate /dev/xvdf   // for 8 G
pvcreate  /dev/xvdg   // for 4 G
No alt text provided for this image

We can verify by pvdisplay command.

Step-b Create VG

vgcreate iiec /dev/xvdf /dev/xvdg
No alt text provided for this image

We can see iiec vg by "vgdisplay iiec" command (in above screenshot).

Step-c Create LV

First I want to share only 5 GiB to namenode . So I create a LV of 5 Gb.

lvcreate --size 5GiB --name MyLV iiec
No alt text provided for this image

We can see our created lv (MyLV) by using

lvdisplay iiec/MyLV
No alt text provided for this image

Step-d Format LV

mkfs.ext4 /dev/iiec/MyLV
No alt text provided for this image

Step-e Mount LV to a folder

First create a directory/folder by using

mkdir /root/limited

Then mount the LV to the folder by

mount /dev/iiec/MyLV   /root/limited

We can see all mounted point by 'df -h 'command

No alt text provided for this image

Step-3 Replace our mounted directory in configuration file

It is good practice , to first stop datanode service before changing hadoop configration files. We can stop datanode service

Now goto /etc/hadoop location

hdfs-site.xml 

No alt text provided for this image

core-site.xml  

No alt text provided for this image

Start datanode service

hadoop-daemon.sh start datanode
No alt text provided for this image

Now From namenode we can see how much storage we got from datanode

No alt text provided for this image

In this way , our slave can contribute his limited storage to masternode.


Step-E How we can Increase / Decrease contributed storage of datanode

1.Increase LVM storage

  • Let' s see how we can increase our contributed storage of datanode without losing our older data.
  • We share 5 GiB to namenode .Now I want to share 10 GiB .
  • For this we have to increase size of LV(MyLV)

We can increase LV size by using below command

lvextend --size +5GiB  /dev/iiec/MyLV

No alt text provided for this image

We can also see our LV size by

lvdisplay  iiec/MyLV
No alt text provided for this image
  • We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.
  • Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
  • But If we format ,then our entire older data will be lost . So we have to format only un-formatted data by We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.
No alt text provided for this image
  • Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
  • But if format LV, then we will lost entire our older data. So we have to only format the un-formatted part by using:-
resize2fs  /dev/iiec/MyLV

This command will only format our half remaining LV and we will not lost our older data.

No alt text provided for this image

Now we can also see from namenode that our storage increase on the fly without stopping services of datanode and namenode.

2. Decrease LVM Storage/Shrink Volume of Datanode :-

In this step ,you can learn how we can reduce our volume that will be shared to namenode by datanode.

To reduce LVM storage ,we have to follow some steps .So that our previous data would be safe.

  1. Stop datanode service
  2. Unmount the Logical volume
  3. Clean/Scan
  4. Resize the LV
  5. Reduce the LV storage
  6. Mount back to the directory
  7. Start datanode Service

1. Stop datanode service

hadoop-daemon.sh stop datanode
No alt text provided for this image

2. Unmount the Logical volume

Before unmount the directory , I will store some files in the directory to see that our data will be secured. I create a file lw.html and some folders are pre-created inside the LVM storage.

No alt text provided for this image

But before umount close all files of that directory(limited) Now we unmount the limited directory . by

umount /root/limited
No alt text provided for this image

3. Scan/Clean

When we store and read a file , then it may also store garbage values. So first we clean these garbage values. For this we run

e2fsck -f /dev/iie/MyLV
No alt text provided for this image

4. Resize the LV

This is the most important step. As our data is not saved continuously . So if we reduce LV ,then it might lost some data. So first we resize the data . It means re-create inode table till 6G.

resize2fs  /dev/iiec/MyLV 6G
No alt text provided for this image

Now our data will be saved in 6GB . So if we reduce (total-6GB=4GB) ,then our data will be saved otherwise we lost / corrupt our data. It means the 4Gb space is empty. So if we remove that space then we don't lost our data.

5. Reduce the LV storage

Here I shrink the whole volume into 6GB storage. So that our older data would be safe.

lvreduce  -L 6G /dev/iiec/MyLV
or
lvreduce  --size -4G /dev/iiec/MyLV

Both command will give same result . This command will reduce the LV size from 10 G to 6 G .

No alt text provided for this image

We can see our LV new size by "lvdispaly iiec/MyLV"

No alt text provided for this image

6. Mount back to the directory

Now again mount the LV storage to the directory /root/limited

mount /dev/iiec/MyLV   /root/limited

7. Start datanode service

hadoop-daemon.sh start datanode
No alt text provided for this image

Now we can verify from the namenode

No alt text provided for this image

We can see that our data is also safe

No alt text provided for this image

Hence in this way company can increase and decrease the datanode storage without losing their older data .

Thanks..

要查看或添加评论,请登录

Deepak Sharma的更多文章

  • Jenkins Dynamic Provisioning

    Jenkins Dynamic Provisioning

    Objectives In this article , We will see how we can create dynamic slave on the fly when job come and attach to the…

    1 条评论
  • OSPF Routing Protocol using Dijkstra Algorithm

    OSPF Routing Protocol using Dijkstra Algorithm

    Objectives:- In this article, We will learn about Dijkstra Algorithm and Open Short Path First(OSPF) Routing Protocol .…

    1 条评论
  • MongoDB Case study: Forbes

    MongoDB Case study: Forbes

    Objective In this article , we see how MongoDB Cloud Migration Helps World's Biggest Media Brand Continue To Set…

  • Vehicle’s Number Plate Detection using CNN model using python and Flask API…

    Vehicle’s Number Plate Detection using CNN model using python and Flask API…

    In this article, I am going to show you how you can create CNN Model or Deep Learning Model for Vehicle’s Number Plate…

    8 条评论
  • K-means Clustering and its real use cases in security domain

    K-means Clustering and its real use cases in security domain

    Objectives:- In this article, we will see about the Kmean algorithm and how Kmean algorithm helps in security domain to…

  • JavaScript:- Industry Use-cases

    JavaScript:- Industry Use-cases

    Objective In this article , we will learn about the JavaScript and the use-cases of JavaScript. How Industries utilizes…

  • Confusion Matrix and Cyber Security

    Confusion Matrix and Cyber Security

    Objectives:- In this article , we will see about confusion matrix and the use of confusion matrix . Also we see how…

  • Self-Reflection of MongoDB-Workshop

    Self-Reflection of MongoDB-Workshop

    # Day1 (1st May 2021) ?? Introduction of the file system? ??The data we will stored in file and that file we basically…

  • OpenShift case study:- Cisco

    OpenShift case study:- Cisco

    Cisco’s success depends on its ability to quickly deliver innovative IT products and solutions to customers. Delays can…

  • Industry Use cases of Jenkins:- Prepl

    Industry Use cases of Jenkins:- Prepl

    In 2021, When industries are running towards automation, adopting different DevOps tools to solve their industrial…

社区洞察

其他会员也浏览了