Let’s research and the world the know about the Myths of Hadoop

Let’s research and the world the know about the Myths of Hadoop

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Such clusters run Hadoop's open source distributed processing software on low-cost commodity computers.

Task 4.1 :- Individual/Team task:

??In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?

Task 4.2 :- Team task:

??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

???? Research with your teams and conclude this statement with proper proof

Solution:

??In a Hadoop cluster, find how to contribute limited/specific amount of storage as slave to the cluster?

First of all we have to setup Hadoop Cluster. Here I am using 1 NameNode and 1 DataNode.

Configure the NameNode:

we need the jdk and hadoop for setup hadoop cluster. I have already transfer both the software in EC2 instance.

No alt text provided for this image

Now install the jdk and hadoop software. First I install jdk software because it is necessary to install before hadoop software.

rpm -ivh jdk-8u171-linux-x64.rpm 

java -version      //for checking java is installed 
No alt text provided for this image
rpm -ivh hadoop-1.2.1-1.x86_64.rpm  --force

hadoop version      //for checking hadoop is installed 
No alt text provided for this image

Now we update the hdfs-site.xml and core-site.xml files at location /etc/hadoop.

No alt text provided for this image
No alt text provided for this image

Now we have to create one directory named nn and format it.

mkdir /nn

hadoop namenode -format
No alt text provided for this image

Now start the namenode by typing hadoop-daemon.sh start namenode and see the report of DataNode connected to the NameNode by typing hadoop dfsadmin -report.

hadoop-daemon.sh start namenode  //for starting namenode

hadoop dfsadmin -report  //for showing the information about connected datanode
No alt text provided for this image

Configure DataNode:

In the DataNode, I attach another 4GiB EBS volume to this instance and I want to share it to the NameNode.

No alt text provided for this image
No alt text provided for this image

Now type fdisk -l command to see total hard disk attach to the instance.

fdisk -l
No alt text provided for this image

Now create partition inside this 4GiB Volume. For creating partition use fdisk command to create partitions on a block device.

fdisk /dev/xvdf   //For creating partition
No alt text provided for this image

Now check the total hard disk by typing fdisk -l command.

fdisk -l
No alt text provided for this image

Now format this partition:

mkfs.ext4  /dev/xvdf1
No alt text provided for this image

Now I created one directory named dn at / location so I mount this directory to the 4GiB partition and share it to the NameNode.

No alt text provided for this image

Now come to the DataNode setup, For this I need the jdk and hadoop for setup hadoop cluster. I have already transfer both the software in EC2 instance.

No alt text provided for this image

Now install the jdk and hadoop software. First I install jdk software because it is necessary to install before hadoop software.

rpm -ivh jdk-8u171-linux-x64.rpm 


java -version      //for checking java is installed 
No alt text provided for this image
rpm -ivh hadoop-1.2.1-1.x86_64.rpm  --force


hadoop version      //for checking hadoop is installed
No alt text provided for this image

Now we update the hdfs-site.xml and core-site.xml files at location /etc/hadoop and share the same directory to the namenode which I mounted above.

No alt text provided for this image
No alt text provided for this image

Now start the namenode by typing hadoop-daemon.sh start datanode and see the report of DataNode connected to the NameNode by typing hadoop dfsadmin -report.

hadoop-daemon.sh start datanode

hadoop dfsadmin -report
No alt text provided for this image

In this way the task Contribute Limited Amount of Storage of Data Node in Hadoop Cluster is successfully completed.

Solution:

??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling the "Velocity" problem. I will prove here, that this is a wrong statement / assumption.

For this I have setup of Hadoop cluster with 1 Name Node, 4 Data Nodes and 1 Client, configure everything and to make it ready for use.

Monitor port 50010 for data packet transfer.

tcpdump -i eth0 port 50010 show the incoming and outgoing traffic at 50010 port.

tcpdump -i eth0 port 50010
No alt text provided for this image

Here I setup the complete cluster. I type the tcpdump command in all the DataNode as well as NameNode. Now I have already created a file a.txt (182MiB in size). Now I put this file through client to the Hadoop cluster.

No alt text provided for this image

Now you can see in the above image that Data is received by DataNode 2.

No alt text provided for this image

Now In the above image, You can differentiate it. As soon as DataNode 2 save the block then next block store in DataNode 1 and after that next block again store in the DataNode 2.

Hence I proved that Hadoop does not use the concept of parallelism to upload the split data while fulfilling Velocity problem.

Thanks for Reading the Article !!!


要查看或添加评论,请登录

社区洞察

其他会员也浏览了