Hadoop Cluster Revealed
Vaibhav S.
Lead Cybersecurity Engineer | Cybersecurity Engineering | ex-PwC | Helping Companies Prevent Cyberattacks | RHCSA | RHCE | eJPT | CEH(P) | ICCA | RHCSSMA | CCP | CSA | CIAP-DIAT| CSIL-CDWI | CSIL-COA
The problem Statement?
To know how the Hadoop Works Internally as a whole.
According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.
The Setup
I used docker here to run all the nodes why?
To make the time delay between two node minimum,to be handy, less resource consumption == more Research && Less lags.
As best of all Independent environment best for Analysis.
How it went of?
Step1- Creating a Docker Image
Creating a basic image from CentOs with name hadoop_datanode
docker run -it --name hadoop_datanode -v /home/leo/Desktop/dockhaddop:/home/ centos
Steps in While inside docker
Configuring Essentials
Python is used for gdown to download files easily.
Nano is cli editor.
Initscripts is the Dependency for Hadoop.
Wireshark-Cli stands for Tshark in CentOS.
yum install python2 nano initscripts wireshark-cli python2 -m pip install gdown gdown --id 17UWQNVdBdGlyualwWX4Cc96KyZhD-lxz gdown --id 1541gbFeGZZJ5k9Qx65D04lpeNBw87rM5 rpm -ivh jdk-8u171-linux-x64.rpm rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force mkdir /dn2
Here we first installs all dependencies and then download gdown via python2 and download the gdrive files in container and install them respectively.
And creating a dn2 directory under / so we don't need to create it every time in each node.
Configuring Core-Site.xml & Hdfs-site.xml
nano /etc/hadoop/core-site.xml
<configuration><property><name>fs.default.name</name><value>hdfs://0.0.0.0:9001</value></property></configuration>
nano /etc/hadoop/hdfs-site.xml
<configuration><property><name>dfs.data.dir</name><value>/dn2</value></property></configuration>
------------- Creating a Image
Now in base system we will create a Docker Image so that we can use it to run our name nodes and data nodes.
docker commit hadoop_datanode hadoop_centos:v2
Launching the containers
-Launching name node
Ip of Name Node container is 172.17.0.2
docker run -it --name hadoop_namenode -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2
-Launching datanode1
Ip of Data Node container is 172.17.0.3
docker run -it --name hadoop_datanode1 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2
-Launching datanode2
Ip of Data Node container is 172.17.0.4
docker run -it --name hadoop_datanode2 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2
-Launching datanode3
Ip of Data Node container is 172.17.0.5
docker run -it --name hadoop_datanode3 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2
So here we are creating a link between /home/leo/Desktop/dockhaddop and /home in container (Please ignore the spelling Mistake in directory).
--cap-add=NET_RAW --cap-add=NET_ADMIN
These arguments enables us to capture the data in containers without any errors
Set the Ip Accordingly to your cluster (name and data node) in
/etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
Run the services etc
As this is a basic step to am skipping the steps in it (launch Datanode and NameNode services and format the NameNode file structure before it).
Understanding the Tshark
Tshark uses the one of the best engine to capture the packets and is so much handy its a cli version of Wireshark which is undoubtedly the best in its league.
tshark -D
This shows the available adapters but in container we will only have 2 eth0 and loop back one.
We use a Capture Filter in order to filter out the packets.
-f "tcp port 50010"
We use it on port 50010
-P -w namenode2.pcap
To write the file and -P to write in file as well as on screen.
cd /home
To go to the linked directory so we can see the files in main OS.
Creating pcap files with different name of each node.
On data node1
tshark -P -f "tcp port 50010" -w namenode1.pcap
On data node2
tshark -P -f "tcp port 50010" -w namenode2.pcap
On data node3
tshark -P -f "tcp port 50010" -w namenode3.pcap
Here we have a issue with Tshark on RedHat/Centos and also as the permissions in Base Linux and Containers varies so we tshark will show the error so we create the files in Linux under
/home/leo/Desktop/dockhaddop
Here cause of which i faced a small bug in tshark which was closed earlier but logically seems to be same for the container due to the conflict of permissions which can be seen in below image as well.
Resolving the Problem with tshark
We create the pcap files before hands and give it suitable permissions.
cd /home/leo/Desktop/dockhaddop touch namenode1.pcap namenode2.pcap namenode3.pcap chmod 666 *
Now as we are done with the SetUp part lets come to analysis.
Whole Part is now shown in the Video so am not considering the practical steps here.
video url
These are the observation of 2nd video on the blog observation of video on linkedin is given at the end in blog with 2nd video also
second video can be viewed there as video quality and size is restricted on LinkedIn :(
Visit the below mentioned Link (Not able to update it as hyper Link) something to do with linkedIn and Google.
https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html
The Observations
So the First and Last Packet Arrival times are given below (we will do a sort of Relative Packet Analysis Here with node 1,2&3 Resp.
For node1
Arrival time of 1st packet
Oct 16, 2020 05:01:48.520364822 IST
Arrival time of Last Packet
Oct 16, 2020 05:01:52.955903596 IST
For node 2
Arrival time of 1st packet
Oct 16, 2020 05:01:48.483643564 IST
Arrival time of Last Packet
Oct 16, 2020 05:01:52.955843610 IST
For node 3
Arrival time of 1st packet
Oct 16, 2020 05:01:48.522872353 IST
Arrival time of Last Packet
Oct 16, 2020 05:01:52.949596427 IST
So here we can clearly see the Packet First Arrives at Node2 then at Node1 then at Node3.
Node3 Finishes the writing part first then Node2 then Node1
From this we can infer the connections were not made in parallel.
But from the end time we can infer than connections were going on while the other node was connected.
So from this we can infer the connections are not made in Parallel it might get delay according to the decisions made by the cluster.
From the Deep packet analysis I got to know that master node contacted the node2 first then node 2 contacted the node1 and node3 was never contacted by the master node and replicas were made and it was contacted by node2 first then by node1 just as to create the replicas.
The nodes communicates among themselves first then creates the blocks and the replicas. While making and breaking connections as they are needed either Parallel or in Series.
So we infer that master node only contacts only single node initially and instructs where to send the other block and the slave node then creates the other part of the block on other nodes (here in node1) and master contact the other node only to Sync data or to Partially transfer the data and replicas are made by the slaves themselves.
Shows node3 was never contacted by ip 172.17.0.2 i.e master t nodes Ip as there is no packet from that IP.
Screenshots of namenode3.pcap captured from node3
Upper pic shows Node3 was contacted by Node2 first and Node1 but not by master node any any point in transfer of data.(but it was used to create the Replicas)
Screenshots of namenode1.pcap captured from node1
Node1 was contacted by node2 first (upper Image)
Node1 contacted by master node in middle of whole session (upper screenshot).
Screenshots of namenode2.pcap captured from node2
Node2 was the only one contacted first by master node.
So we can infer that there is no as such serial or parallel communication in hadoop. But as hadoop master communicates with only single node hence it is not completely parallel but as datanode do communicate with other node so its a sort of parallel, but can't solve the data velocity problem to a great extent in client only connects to the masternode but as it might be instructed by masternode to contact datanode velocity problem can be solved.
hence proved MythBusted.
According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.
get all the pcap files here
Blog Url
https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html
Video 1 url
Observation of 1st (above) video
??so it took 1 sec location 1,0 in terminal hadoop_datanode1 172.17.0.3
Arrival time Oct 16, 2020 04:45:34.778896426 IST
connection is closed here with a RST
Last end time Oct 16, 2020 04:45:35.121781426 IST
so it took 1 sec location 0,1 in terminal hadoop_datanode2 172.17.0.4
Oct 16, 2020 04:45:34.810703728 IST
Oct 16, 2020 04:45:35.121750686 IST
location 1,1 in terminal hadoop_datanode3 172.17.0.5
Oct 16, 2020 04:45:34.845470669 IST
Oct 16, 2020 04:45:35.114580650 IST
In above observation the node1 was contacted first as hadoop uses Round Robin Algo. And the concept of Observation before can also we used here.
So the Result of Observation1 matches that on Observation2 hence I can say the analysis is Correct Logically.
Video2 on blog
Thank You for your time
Tech Writer@medium ★ Content creator@youtube ★ 2x Red Hat Certified (RHCE) & Openshift Administration I ★ AI Infrastructure Engineer ★ Sr. DevOps Engineer ★
4 年Great work ????
Looking for new opportunities in Germany Site Reliability Engineer II @ JP Morgan Chase. @IUDX, @IISC
4 年Great work brother ????
Salesforce DevOps Engineer at Red Hat
4 年Very different approach hadoop on docker container amazing ????. I have seen first post like this and more knowledgable post . Actually you proved all technologies are integrated. Great work Vaibhav S. ????.
DevOps Engineer @Amdocs
4 年??????