Let’s research and the world the know about the Myths of Hadoop
Nishant Singh
Software Engineer@HCL Tech | Red Hat Certified System Administrator | AWS Certified Solution Architect-Associate | AWS Certified Developer Associate | AWS Cloud Practitioner Certified
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Such clusters run Hadoop's open source distributed processing software on low-cost commodity computers.
Task 4.1 :- Individual/Team task:
??In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?
Task 4.2 :- Team task:
??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.
???? Research with your teams and conclude this statement with proper proof
Solution:
??In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?
First of all we have to setup Hadoop Cluster. Here I am using 1 NameNode and 1 DataNode.
Configure the NameNode:
we need the jdk and hadoop for setup hadoop cluster. I have already transfer both the software in EC2 instance.
Now install the jdk and hadoop software. First I install jdk software because it is necessary to install before hadoop software.
rpm -ivh jdk-8u171-linux-x64.rpm java -version //for checking java is installed
rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force hadoop version //for checking hadoop is installed
Now we update the hdfs-site.xml and core-site.xml files at location /etc/hadoop.
Now we have to create one directory named nn and format it.
mkdir /nn hadoop namenode -format
Now start the namenode by typing hadoop-daemon.sh start namenode and see the report of DataNode connected to the NameNode by typing hadoop dfsadmin -report.
hadoop-daemon.sh start namenode //for starting namenode hadoop dfsadmin -report //for showing the information about connected datanode
Configure DataNode:
In the DataNode, I attach another 4GiB EBS volume to this instance and I want to share it to the NameNode.
Now type fdisk -l command to see total hard disk attach to the instance.
fdisk -l
Now create partition inside this 4GiB Volume. For creating partition use fdisk command to create partitions on a block device.
fdisk /dev/xvdf //For creating partition
Now check the total hard disk by typing fdisk -l command.
fdisk -l
Now format this partition:
mkfs.ext4 /dev/xvdf1
Now I created one directory named dn at / location so I mount this directory to the 4GiB partition and share it to the NameNode.
Now come to the DataNode setup, For this I need the jdk and hadoop for setup hadoop cluster. I have already transfer both the software in EC2 instance.
Now install the jdk and hadoop software. First I install jdk software because it is necessary to install before hadoop software.
rpm -ivh jdk-8u171-linux-x64.rpm
java -version //for checking java is installed
rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force
hadoop version //for checking hadoop is installed
Now we update the hdfs-site.xml and core-site.xml files at location /etc/hadoop and share the same directory to the namenode which I mounted above.
Now start the namenode by typing hadoop-daemon.sh start datanode and see the report of DataNode connected to the NameNode by typing hadoop dfsadmin -report.
hadoop-daemon.sh start datanode hadoop dfsadmin -report
In this way the task Contribute Limited Amount of Storage of Data Node in Hadoop Cluster is successfully completed.
Solution:
??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.
According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling the "Velocity" problem. I will prove here, that this is a wrong statement / assumption.
For this I have setup of Hadoop cluster with 1 Name Node, 4 Data Nodes and 1 Client, configure everything and to make it ready for use.
Monitor port 50010 for data packet transfer.
tcpdump -i eth0 port 50010 show the incoming and outgoing traffic at 50010 port.
tcpdump -i eth0 port 50010
Here I setup the complete cluster. I type the tcpdump command in all the DataNode as well as NameNode. Now I have already created a file a.txt (182MiB in size). Now I put this file through client to the Hadoop cluster.
Now you can see in the above image that Data is received by DataNode 2.
Now In the above image, You can differentiate it. As soon as DataNode 2 save the block then next block store in DataNode 1 and after that next block again store in the DataNode 2.
Hence I proved that Hadoop does not use the concept of parallelism to upload the split data while fulfilling Velocity problem.
Thanks for Reading the Article !!!