Integration of LVM with Hadoop-Cluster To contribute limited storage of datanode on aws
Deepak Sharma
3 x RedHat Certified Engineer (EX200, EX294, EX180) || DevOps Engineer || Docker || K8s || Ansible || Linux || Git || Github || Gitlab || Terraform || Jenkins || Cloud Computing || AWS
In this article , we learn to contribute limited storage of datanode to the namenode by using LVM partition concept.
First let’s understand the purpose of this task
In hadoop cluster , slaves/data node contributes their full storage to masternode but suppose in some of the cases if any slave want to contribute only some sort of limited kind of storage then it can do like that .So now let’s see how slave can contribute his limited storage to masternode.
In this article , I will also cover :-
- How to create Hadoop cluster on aws?
- Configure Namenode/Master on aws
- Configure Datanode/Slave on aws
- How to integrate LVM with Hadoop
- How we can increase contributed storage of datanode
- How we can decrease contributed storage of datanode
Before start , first we will start introduction section . In this section , you will learn some definitions that will be use in this article .
Introduction :-
- Hadoop Namenode :- Hadoop NameNode is the centralized place of an HDFS file system which keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to datanodes.
- Datanode :- DataNode is responsible for storing the actual data in HDFS. NameNode and DataNode are in constant communication. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
- LVM :- Logical Volume Management enables the combining of multiple individual hard drives or disk partitions into a single volume group (VG). That volume group can then be subdivided into logical volumes (LV) or used as a single large volume.
- LVM helps to provides Elasticity to the storage Device and it’s an advance version of partition.
Let's Started:-
In this task , I use two aws instance . One for master/Namenode and other for datanode . For making hadoop cluster we require to install jdk and hadoop on the top of both NameNode as well as DataNode.
Step-A Install JDK and Hadoop on Namenode
Here I use redhat 8.0 as base os. As we know hadoop will work on java and our instance is running on aws . I used redhat 8.0 ami to launch instance on aws.So first we will transfer jdk and hadoop(version 1) software on aws by using winscp software.
Hadoop (version 1 ) is not comfortable on every java version . So here I provide you link of jdk and hadoop(version 1) .
Get jdk software Get Hadoop Software Get winscp Software
- Install Jdk
After copy jdk and hadoop software on aws instance (namenode).First we go to loaction (where software are copied) . In my case , location is /home/ec2-user .Then install jdk software on namenode by below command:-
rpm -ivh jdk-8u171-linux-x64.rpm
We can check by jps command
2. Install Hadoop :-
Hadoop works on the java . So first we install java . Now we can install hadoop by
rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force
Now we have to check that this hadoop version 1 is comfortable with java . We can verify by :-
hadoop version
Here we did not get any error . So our hadoop is working fine with java.
Step-B Configure Namenode :-
There are few steps to configure namenode:-
- Install jdk and Hadoop
- Create a directory in / drive of redhat
- configure hadoop files
- format namenode directory
- Start namenode service
1.Install jdk and Hadoop :- Installation of jdk and hadoop )is already done in above step.
2. Create a directory in / drive of redhat :- Here I create a directory /nn in namenode
mkdir /nn
3. Configure hadoop files:-The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml and core-site.xml
hdfs-site.xml
core-site.xml
- AWS gives two ips to instance :- public and private.
- As public ip can connect to only public ip and private ip only connect to private ip.
- So if I allocate private ip to namenode ,then no one can connect to namenode. But if I give public ip ,then hadoop does not work .Because Hadoop is only know private ip of instance.
That's why here I use 0.0.0.0 .So now anybody can connect to namenode by using both public and private ip .
4. Format namenode directory:-
As we know , to save our data permanent ,we have to format our directory. So we format namenode directory by:-
hadoop namenode -format
5. Start namenode service
We can start namenode service . So datanode can connect to our namenode at 9001 port number.
hadoop-daemon.sh start namenode
We can verify by jps command .We can see in above screenshot that our Namenode service is also running .
Step-C Configure datanode
There are few steps to configure Datanode:-
- Install jdk and Hadoop software
- Create a directory in / drive of redhat
- configure hadoop files
- Start datanode service
1.Install jdk and Hadoop software:- Installation of jdk and hadoop )is already done in above step A. There is same process of installation on namenode and datanode.
2. Create a directory in / drive of redhat :-Here I create a directory /dn in namenode
mkdir /dn
3. Configure hadoop files:- The location of hadoop configuration file is /etc/hadoop. In /etc/hadoop ,we find many files . But we have to configure only two files:- hdfs-site.xml and core-site.xml
hdfs-site.xml
core-site.xml
In datanode , we add namenode public ip given by aws.
4. Start datanode service
To start datanode service ,we use
hadoop-daemon.sh start datanode
We can verify by jps command .We can see in above screenshot that our Datanode service is also running .
Note :- If datanode does not connect to namenode , then it may be aws security group issue. So in Namenode security group allow Datanode by adding rule.
We can also see from namenode that how many datanode are connected and how much they share their storage to namenode.
hadoop dfsadmin -report
Here we see our datanode share their entire storage to namenode .Now in next step we see that how we can contribute limited storage to namenode using LVM.
Step-D Integration of Hadoop and LVM
It is good practice to add external storage device to store our critical data . So if our system corrupt , then our data would not be lost.
Here I attach an extra volume from EBS AWS service .To make simple , I divide this process into steps . So you can easily understand
Step-1 Attach two Extra volume to datanode instance
First create two volume by EBS:- Volume-1 (4 GB) and Volume-2 (8 GB)
Now attach both volume to our datanode instance( both(extra-volume and datanode instance) must be running in same subnet (ex:-1a))
Step-2 Now use LVM concept for partition of extra volume
By using LVM concept , we can add both extra volume and use as storage.
- In LVM concept , First we create PV of our extra volume and then create VG .By VG , we can access storage by LV .
- The size of LV depends on our requirement . The maximum size of LV is equal to VG and we can increase size of VG by attaching more PV . So in this way ,we can get unlimited storage .
- Now let's see how we can implement this setup.
In LVM , there are few steps to get storage:-
a . Create PV
b. Create VG
c. Create LV
d. Format LV
e. Mount LV to a folder
We can see all attach volume to instance by "fdisk -l"
Here we see our both volume 4Gb and 8 Gb.
Step-a Create PV
To create pv , first install LVM software by using
yum install lvm2-8:2.03.09-5.el8.x86_64
Now we create pv of both volume (4 & 8)
pvcreate /dev/xvdf // for 8 G pvcreate /dev/xvdg // for 4 G
We can verify by pvdisplay command.
Step-b Create VG
vgcreate iiec /dev/xvdf /dev/xvdg
We can see iiec vg by "vgdisplay iiec" command (in above screenshot).
Step-c Create LV
First I want to share only 5 GiB to namenode . So I create a LV of 5 Gb.
lvcreate --size 5GiB --name MyLV iiec
We can see our created lv (MyLV) by using
lvdisplay iiec/MyLV
Step-d Format LV
mkfs.ext4 /dev/iiec/MyLV
Step-e Mount LV to a folder
First create a directory/folder by using
mkdir /root/limited
Then mount the LV to the folder by
mount /dev/iiec/MyLV /root/limited
We can see all mounted point by 'df -h 'command
Step-3 Replace our mounted directory in configuration file
It is good practice , to first stop datanode service before changing hadoop configration files. We can stop datanode service
Now goto /etc/hadoop location
hdfs-site.xml
core-site.xml
Start datanode service
hadoop-daemon.sh start datanode
Now From namenode we can see how much storage we got from datanode
In this way , our slave can contribute his limited storage to masternode.
Step-E How we can Increase / Decrease contributed storage of datanode
1.Increase LVM storage
- Let' s see how we can increase our contributed storage of datanode without losing our older data.
- We share 5 GiB to namenode .Now I want to share 10 GiB .
- For this we have to increase size of LV(MyLV)
We can increase LV size by using below command
lvextend --size +5GiB /dev/iiec/MyLV
We can also see our LV size by
lvdisplay iiec/MyLV
- We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.
- Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
- But If we format ,then our entire older data will be lost . So we have to format only un-formatted data by We successfully increase our size of LV(MyLV). But when we see by "df -h " command, then only see our previous size of LV.
- Because at this time our storage is half formatted and half is without formatted . So we have to format the LV to store data.
- But if format LV, then we will lost entire our older data. So we have to only format the un-formatted part by using:-
resize2fs /dev/iiec/MyLV
This command will only format our half remaining LV and we will not lost our older data.
Now we can also see from namenode that our storage increase on the fly without stopping services of datanode and namenode.
2. Decrease LVM Storage/Shrink Volume of Datanode :-
In this step ,you can learn how we can reduce our volume that will be shared to namenode by datanode.
To reduce LVM storage ,we have to follow some steps .So that our previous data would be safe.
- Stop datanode service
- Unmount the Logical volume
- Clean/Scan
- Resize the LV
- Reduce the LV storage
- Mount back to the directory
- Start datanode Service
1. Stop datanode service
hadoop-daemon.sh stop datanode
2. Unmount the Logical volume
Before unmount the directory , I will store some files in the directory to see that our data will be secured. I create a file lw.html and some folders are pre-created inside the LVM storage.
But before umount close all files of that directory(limited) Now we unmount the limited directory . by
umount /root/limited
3. Scan/Clean
When we store and read a file , then it may also store garbage values. So first we clean these garbage values. For this we run
e2fsck -f /dev/iie/MyLV
4. Resize the LV
This is the most important step. As our data is not saved continuously . So if we reduce LV ,then it might lost some data. So first we resize the data . It means re-create inode table till 6G.
resize2fs /dev/iiec/MyLV 6G
Now our data will be saved in 6GB . So if we reduce (total-6GB=4GB) ,then our data will be saved otherwise we lost / corrupt our data. It means the 4Gb space is empty. So if we remove that space then we don't lost our data.
5. Reduce the LV storage
Here I shrink the whole volume into 6GB storage. So that our older data would be safe.
lvreduce -L 6G /dev/iiec/MyLV or lvreduce --size -4G /dev/iiec/MyLV
Both command will give same result . This command will reduce the LV size from 10 G to 6 G .
We can see our LV new size by "lvdispaly iiec/MyLV"
6. Mount back to the directory
Now again mount the LV storage to the directory /root/limited
mount /dev/iiec/MyLV /root/limited
7. Start datanode service
hadoop-daemon.sh start datanode
Now we can verify from the namenode
We can see that our data is also safe
Hence in this way company can increase and decrease the datanode storage without losing their older data .
Thanks..