Hadoop Cluster Availability
What is Hadoop ?
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. Hadoop is written in java language and developed by Apache software foundation.
Hadoop Cluster
In this figure we see simple Hadoop cluster. First we talk about client. Client is someone who upload, delete, read, write any file on this Hadoop cluster.
Namenode:-It is also known as master node. It is main component of hadoop cluster. All data node connected with one master node.
Datanode:- It is also known as slave node. Slave node share their own components with master node.
Availability of system or data in the wake of component failure in the system.
My team members:
- Hritik Kumar (Me - Team Leader)
- Ekansh (Learner success head)
- Satyansh Srivastava (Team member)
- Pravat Kumar Nath Sharma (Team member)
- Vijay (Team member)
So here we are using AWS EC2 service to launch 6 instances, 1 for master node 1 for client and 4 for slave node. These all instances are of Red Hat Enterprise Linux 8. We install Hadoop 1.2.1 and jdk-8u171 software on top of all 6 instances. We are using putty software for a connection to all instances.
When installation part completed then we configured hdfs-site.xml and core-site.xml files in all instance except client instance. In client node we only configured core-site.xml file. (All these files are under /etc/hadoop folder)
hdfs-site.xml configuration for the NameNode:
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/nn</value>
</property>
</configuration>
core-site.xml configuration for the NameNode:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:9001</value>
</property>
</configuration>
hdfs-site.xml configuration for all the DataNodes:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/dn</value>
</property>
</configuration>
core-site.xml configuration for all the DataNodes and client:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://54.145.79.242:9001</value>
</property>
</configuration>
Here 54.145.79.242 was the public IP of our Namenode at the time of testing.
For testing purpose we upload a text file of 140MB from client side. This file was store in 3 block because default block size of hadoop version 1 is 64MB. So our file size is more than 64MB so this file needs three block.
First tool we use for this task is tcpdump:-
We observed that when we stop one instance, another one instance get connected and provide service. When we stop second instance, third instance get connected and provide service to the client.
In this practical we saw that there is seamless availability because of replication. Replicas of files are stored on different nodes to fulfill availability. Master node handle Datanode failure situation.
I would like to thank all my team members for actively participating and completing this task.
Data & Product @ Bajaj Finserv | Bridging Data Engineering with Product Strategy for Innovation | Computer Science Engineer
4 年Happy to help. ??
DevOps Engineer at Market Medium
4 年Keep it up