Hadoop Cluster Revealed

Hadoop Cluster Revealed

The problem Statement?

No alt text provided for this image


To know how the Hadoop Works Internally as a whole.

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

The Setup

No alt text provided for this image


I used docker here to run all the nodes why?

To make the time delay between two node minimum,to be handy, less resource consumption == more Research && Less lags.

As best of all Independent environment best for Analysis.

How it went of?

Step1- Creating a Docker Image

Creating a basic image from CentOs with name hadoop_datanode

docker run -it --name hadoop_datanode -v /home/leo/Desktop/dockhaddop:/home/ centos

Steps in While inside docker 

Configuring Essentials

Python is used for gdown to download files easily.

Nano is cli editor.

Initscripts is the Dependency for Hadoop.

Wireshark-Cli stands for Tshark in CentOS.

yum install python2 nano initscripts wireshark-cli
python2 -m pip install gdown
gdown --id 17UWQNVdBdGlyualwWX4Cc96KyZhD-lxz
gdown --id 1541gbFeGZZJ5k9Qx65D04lpeNBw87rM5
rpm -ivh jdk-8u171-linux-x64.rpm
rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force
mkdir /dn2

Here we first installs all dependencies and then download gdown via python2 and download the gdrive files in container and install them respectively.

And creating a dn2 directory under / so we don't need to create it every time in each node.

Configuring Core-Site.xml & Hdfs-site.xml

nano /etc/hadoop/core-site.xml

<configuration><property><name>fs.default.name</name><value>hdfs://0.0.0.0:9001</value></property></configuration>

nano /etc/hadoop/hdfs-site.xml

<configuration><property><name>dfs.data.dir</name><value>/dn2</value></property></configuration>

------------- Creating a Image

Now in base system we will create a Docker Image so that we can use it to run our name nodes and data nodes.

 docker commit hadoop_datanode hadoop_centos:v2

Launching the containers

-Launching name node

Ip of Name Node container is 172.17.0.2

docker run -it --name hadoop_namenode -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode1

Ip of Data Node container is 172.17.0.3

docker run -it --name hadoop_datanode1 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode2

Ip of Data Node container is 172.17.0.4

  docker run -it --name hadoop_datanode2 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode3

Ip of Data Node container is 172.17.0.5

 docker run -it --name hadoop_datanode3 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

So here we are creating a link between /home/leo/Desktop/dockhaddop and /home in container (Please ignore the spelling Mistake in directory).

--cap-add=NET_RAW --cap-add=NET_ADMIN

These arguments enables us to capture the data in containers without any errors

No alt text provided for this image

Set the Ip Accordingly to your cluster (name and data node) in

/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml
No alt text provided for this image
No alt text provided for this image

Run the services etc

As this is a basic step to am skipping the steps in it (launch Datanode and NameNode services and format the NameNode file structure before it).

No alt text provided for this image


Understanding the Tshark

No alt text provided for this image

Tshark uses the one of the best engine to capture the packets and is so much handy its a cli version of Wireshark which is undoubtedly the best in its league.

tshark -D

This shows the available adapters but in container we will only have 2 eth0 and loop back one.

We use a Capture Filter in order to filter out the packets.

-f "tcp port 50010"

We use it on port 50010

-P -w namenode2.pcap

To write the file and -P to write in file as well as on screen.

cd /home

To go to the linked directory so we can see the files in main OS.

Creating pcap files with different name of each node.

On data node1

  tshark  -P -f "tcp port 50010" -w namenode1.pcap

On data node2

  tshark  -P -f "tcp port 50010" -w namenode2.pcap

On data node3

  tshark  -P -f "tcp port 50010" -w namenode3.pcap


Here we have a issue with Tshark on RedHat/Centos and also as the permissions in Base Linux and Containers varies so we tshark will show the error so we create the files in Linux under

/home/leo/Desktop/dockhaddop

Here cause of which i faced a small bug in tshark which was closed earlier but logically seems to be same for the container due to the conflict of permissions which can be seen in below image as well.

Resolving the Problem with tshark

We create the pcap files before hands and give it suitable permissions.

No alt text provided for this image
cd /home/leo/Desktop/dockhaddop
touch namenode1.pcap namenode2.pcap namenode3.pcap
chmod 666 *

Now as we are done with the SetUp part lets come to analysis.

Whole Part is now shown in the Video so am not considering the practical steps here.

video url

These are the observation of 2nd video on the blog observation of video on linkedin is given at the end in blog with 2nd video also

second video can be viewed there as video quality and size is restricted on LinkedIn :(

Visit the below mentioned Link (Not able to update it as hyper Link) something to do with linkedIn and Google.

https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html

The Observations

So the First and Last Packet Arrival times are given below (we will do a sort of Relative Packet Analysis Here with node 1,2&3 Resp.

For node1

Arrival time of 1st packet

Oct 16, 2020 05:01:48.520364822 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.955903596 IST

For node 2

Arrival time of 1st packet

Oct 16, 2020 05:01:48.483643564 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.955843610 IST

For node 3

Arrival time of 1st packet

Oct 16, 2020 05:01:48.522872353 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.949596427 IST

So here we can clearly see the Packet First Arrives at Node2 then at Node1 then at Node3.

Node3 Finishes the writing part first then Node2 then Node1

From this we can infer the connections were not made in parallel.

But from the end time we can infer than connections were going on while the other node was connected.

So from this we can infer the connections are not made in Parallel it might get delay according to the decisions made by the cluster.

From the Deep packet analysis I got to know that master node contacted the node2 first then node 2 contacted the node1 and node3 was never contacted by the master node and replicas were made and it was contacted by node2 first then by node1 just as to create the replicas.

The nodes communicates among themselves first then creates the blocks and the replicas. While making and breaking connections as they are needed either Parallel or in Series.

So we infer that master node only contacts only single node initially and instructs where to send the other block and the slave node then creates the other part of the block on other nodes (here in node1) and master contact the other node only to Sync data or to Partially transfer the data and replicas are made by the slaves themselves.

Shows node3 was never contacted by ip 172.17.0.2 i.e master t nodes Ip as there is no packet from that IP.

Screenshots of namenode3.pcap captured from node3

No alt text provided for this image
No alt text provided for this image

Upper pic shows Node3 was contacted by Node2 first and Node1 but not by master node any any point in transfer of data.(but it was used to create the Replicas)

Screenshots of namenode1.pcap captured from node1

No alt text provided for this image

Node1 was contacted by node2 first (upper Image)

No alt text provided for this image

Node1 contacted by master node in middle of whole session (upper screenshot).

Screenshots of namenode2.pcap captured from node2

No alt text provided for this image

Node2 was the only one contacted first by master node.

So we can infer that there is no as such serial or parallel communication in hadoop. But as hadoop master communicates with only single node hence it is not completely parallel but as datanode do communicate with other node so its a sort of parallel, but can't solve the data velocity problem to a great extent in client only connects to the masternode but as it might be instructed by masternode to contact datanode velocity problem can be solved.

hence proved MythBusted.

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

get all the pcap files here

Blog Url

https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html


Video 1 url


Observation of 1st (above) video

??so it took 1 sec location 1,0 in terminal hadoop_datanode1 172.17.0.3

Arrival time Oct 16, 2020 04:45:34.778896426 IST

connection is closed here with a RST

Last end time Oct 16, 2020 04:45:35.121781426 IST

so it took 1 sec location 0,1 in terminal hadoop_datanode2 172.17.0.4

Oct 16, 2020 04:45:34.810703728 IST

Oct 16, 2020 04:45:35.121750686 IST

location 1,1 in terminal hadoop_datanode3 172.17.0.5

Oct 16, 2020 04:45:34.845470669 IST

Oct 16, 2020 04:45:35.114580650 IST

In above observation the node1 was contacted first as hadoop uses Round Robin Algo. And the concept of Observation before can also we used here.

So the Result of Observation1 matches that on Observation2 hence I can say the analysis is Correct Logically.

Video2 on blog

Thank You for your time

Anshika Sharma

Tech Writer@medium ★ Content creator@youtube ★ 2x Red Hat Certified (RHCE) & Openshift Administration I ★ AI Infrastructure Engineer ★ Sr. DevOps Engineer ★

4 年

Great work ????

Tharak Ram

Looking for new opportunities in Germany Site Reliability Engineer II @ JP Morgan Chase. @IUDX, @IISC

4 年

Great work brother ????

Laveena Jethani

Salesforce DevOps Engineer at Red Hat

4 年

Very different approach hadoop on docker container amazing ????. I have seen first post like this and more knowledgable post . Actually you proved all technologies are integrated. Great work Vaibhav S. ????.

Aaditya Tiwari

DevOps Engineer @Amdocs

4 年

??????

要查看或添加评论,请登录

Vaibhav S.的更多文章

社区洞察

其他会员也浏览了