Leveraging The Power of Logical Volume Manager (LVM) In Hadoop-DFS Cluster
Chetan Vyas
MLOps | DevOps | Hybrid MultiCLoud | Ansible | Flutter | RedHat Linux | Openstack
In this article, we are practically going to integrate the LOGICAL VOLUME MANAGER (LVM) concepts with Data-Node of Hadoop Distributed File System (HDFS) Cluster to provide the elasticity to the Data-Node Storage so that we can dynamically scale the storage capacity of our Data-Nodes or the storage capacity of the Cluster.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
Hadoop itself is an open-source distributed processing framework that manages data processing and storage for big data applications. HDFS is a key part of the many Hadoop ecosystem technologies. It provides a reliable means for managing pools of bigdata and supporting related big data analytics applications.
HDFS architecture, NameNodes and DataNodes
HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the primary server that manages the file system namespace and controls client access to files. As the central component of the Hadoop Distributed File System, the NameNode maintains and manages the file system namespace and provides clients with the right access permissions. The system's DataNodes manage the storage that's attached to the nodes they run on.
HDFS exposes a file system namespace and enables user data to be stored in files. A file is split into one or more of the blocks that are stored in a set of DataNodes. The NameNode performs file system namespace operations, including opening, closing and renaming files and directories. The NameNode also governs the mapping of blocks to the DataNodes. The DataNodes serve read and write requests from the clients of the file system. In addition, they perform block creation, deletion and replication when the NameNode instructs them to do so.
HDFS supports a traditional hierarchical file organization. An application or user can create directories and then store files inside these directories. The file system namespace hierarchy is like most other file systems -- a user can create, remove, rename or move files from one directory to another.
The NameNode records any change to the file system namespace or its properties. An application can stipulate the number of replicas of a file that the HDFS should maintain. The NameNode stores the number of copies of a file, called the replication factor of that file.
LOGICAL VOLUME MANAGER (LVM)
LVM is a tool for logical volume management which includes allocating disks, striping, mirroring and resizing logical volumes.
With LVM, a hard drive or set of hard drives is allocated to one or more physical volumes. LVM physical volumes can be placed on other block devices which might span two or more disks.
The physical volumes are combined into logical volumes, with the exception of the /boot partition. The /boot partition cannot be on a logical volume group because the boot loader cannot read it. If the root (/) partition is on a logical volume, create a separate /boot partition which is not a part of a volume group.
Since a physical volume cannot span over multiple drives, to span over more than one drive, create one or more physical volumes per drive.
The volume groups can be divided into logical volumes, which are assigned mount points, such as ' /home ' and ' / ' and file system types, such as ext2 or ext3. When "partitions" reach their full capacity, free space from the volume group can be added to the logical volume to increase the size of the partition. When a new hard drive is added to the system, it can be added to the volume group, and partitions that are logical volumes can be increased in size.
On the other hand, if a system is partitioned with the ext3 file system, the hard drive is divided into partitions of defined sizes. If a partition becomes full, it is not easy to expand the size of the partition. Even if the partition is moved to another hard drive, the original hard drive space has to be reallocated as a different partition or not used.
Integrating LVM (LOGICAL VOLUME) with Hadoop DFS to provide Elasticity to DataNode Storage
We can integrate LVM with Hadoop DFS by using Logical Volume (LV) in Data-Nodes of Hadoop Distributed File System (HDFS) cluster so that in the future we increase (extend) the size or Storage capacity of Data-Nodes by attaching more Block Devices and extending Logical Volume (We can also reduce the storage size of Data-Nodes as per our need).
Implementation :
Step 1: Attaching Block Device (Hard Drive) to the Data-Node
Now I have a 3GB Hard Disk ( /dev/sda ) attached to the Data-Node System
Step 2: Creating Physical Volume ( PV )
pvcreate /dev/sda
Created a physical volume (PV) of that 3GB disk (/dev/sda)
Step 3: Creating Volume Group ( VG )
Create the Volume Group (VG) using the previously created physical volume (PV) - /dev/sda
vgcreate HadoopVG /dev/sda
Volume Group 'HadoopVG' of <3 GB size is created
Step 4: Create Logical Volume ( LV )
Create Logical Volume (LV) using the previously creates Volume Group (HadoopVG) of the desired size. I am going to use the complete size of our Volume Group
lvcreate --name DatanodeLV -l 100%FREE HadoopVG
Logical Volume DatanodeLV of <3GB size is created
Step 5: Format LV and Mount to Data-Node Directory
Now to use this LV we have to Format it first then mount it to the Directory that we are going to use as Data-Node directory
Formating LV
mkfs.ext4 /dev/mapper/HadoopVG-DatanodeLV
Mounting LV
mkdir /lvdir mount /dev/mapper/HadoopVG-DatanodeLV /lvdir
Our Logical Volume is mounted to /lvdir Directory
Step 6: Configuring Data-Node for HDFS Cluster
Now configure the Data-Node and use '/lvdir' as the Data-Node directory and start the DataNode Daemon
Starting the Data-Node Daemon
hadoop-daemon.sh start datanode
Now our Datanode is configured and if we check our HDFS cluster we can see our Datanode with near about < 3 GB storage capacity
Extending the Storage Capacity of Data Node Dynamically while Cluster is UP
We use Logical Volume in our Data Node so we can increase the storage capacity of our data node dynamically by extending VG and LV without unmounting while our HDFS Cluster is up by attaching more Block Devices to the data node
Step 1: Attach more Hard Drive to Data Node
We can directly extend the Logical Volume (LV) but I don't have more space in my Volume Groupe (VG) so I have to first extend my VG and to extend VG we need Physical Volume and for that, I am going to attach one more 2GB Hard Drive to Data Node
Step 2: Create PV so we can extend our VG
To extend our Volume Group we need Physical Volume so I am going to create PV from the recently attached Disk /dev/sdb of 2GB
pvcreate /dev/sdb
Step 3: Extend Volume Group (VG)
Extending our VG -HadoopVG using the PV -/dev/sdb
vgextend HadoopVG /dev/sdb
VG is extended and it have <2GB free space
Step 4: Extend Logical Volume (LV)
we extended our VG and now it contains <2GB of free space (PE) so now finally we can extend our Logical Volume (DatanodeLV) up to that free space
lvextend -l +100%FREE /dev/HadoopVG/DatanodeLV
now we can see our LV extended to 5GB (4.99GB)
Step 5: Update FileSystem Using resize2fs
After extending our LV we need to update the FileSystem using resize2fs
resize2fs /dev/HadoopVG/DatanodeLV
we can see that the Volume of our data node directory (/lvdir) is extended
Checking our Hadoop-DFS Cluster Report
Now finally we can see that the Storage Capacity of our Data-Node extended to ~5GB (4.85GB)
And now from onward using the LVM, we scale our scale the storage capacity of our Data-Nodes or storage capacity of the HDFS Cluster.