登录查看更多内容

Hadoop Cluster Revealed

Vaibhav S.

Lead Cybersecurity Engineer | Cybersecurity Engineering | ex-PwC | Helping Companies Prevent Cyberattacks | RHCSA | RHCE | eJPT | CEH(P) | ICCA | RHCSSMA | CCP | CSA | CIAP-DIAT| CSIL-CDWI | CSIL-COA

发布日期: 2020年10月16日

+ 关注

The problem Statement?

To know how the Hadoop Works Internally as a whole.

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

The Setup

I used docker here to run all the nodes why?

To make the time delay between two node minimum,to be handy, less resource consumption == more Research && Less lags.

As best of all Independent environment best for Analysis.

How it went of?

Step1- Creating a Docker Image

Creating a basic image from CentOs with name hadoop_datanode

docker run -it --name hadoop_datanode -v /home/leo/Desktop/dockhaddop:/home/ centos

Steps in While inside docker

Configuring Essentials

Python is used for gdown to download files easily.

Nano is cli editor.

Initscripts is the Dependency for Hadoop.

Wireshark-Cli stands for Tshark in CentOS.

yum install python2 nano initscripts wireshark-cli
python2 -m pip install gdown
gdown --id 17UWQNVdBdGlyualwWX4Cc96KyZhD-lxz
gdown --id 1541gbFeGZZJ5k9Qx65D04lpeNBw87rM5
rpm -ivh jdk-8u171-linux-x64.rpm
rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force
mkdir /dn2

Here we first installs all dependencies and then download gdown via python2 and download the gdrive files in container and install them respectively.

And creating a dn2 directory under / so we don't need to create it every time in each node.

Configuring Core-Site.xml & Hdfs-site.xml

nano /etc/hadoop/core-site.xml

<configuration><property><name>fs.default.name</name><value>hdfs://0.0.0.0:9001</value></property></configuration>

nano /etc/hadoop/hdfs-site.xml

------------- Creating a Image

Now in base system we will create a Docker Image so that we can use it to run our name nodes and data nodes.

 docker commit hadoop_datanode hadoop_centos:v2

Launching the containers

-Launching name node

Ip of Name Node container is 172.17.0.2

docker run -it --name hadoop_namenode -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode1

Ip of Data Node container is 172.17.0.3

docker run -it --name hadoop_datanode1 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode2

Ip of Data Node container is 172.17.0.4

  docker run -it --name hadoop_datanode2 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

-Launching datanode3

Ip of Data Node container is 172.17.0.5

 docker run -it --name hadoop_datanode3 -v /home/leo/Desktop/dockhaddop:/home/ --cap-add=NET_RAW --cap-add=NET_ADMIN hadoop_centos:v2

So here we are creating a link between /home/leo/Desktop/dockhaddop and /home in container (Please ignore the spelling Mistake in directory).

--cap-add=NET_RAW --cap-add=NET_ADMIN

These arguments enables us to capture the data in containers without any errors

Set the Ip Accordingly to your cluster (name and data node) in

/etc/hadoop/core-site.xml
/etc/hadoop/hdfs-site.xml

Run the services etc

As this is a basic step to am skipping the steps in it (launch Datanode and NameNode services and format the NameNode file structure before it).

Understanding the Tshark

Tshark uses the one of the best engine to capture the packets and is so much handy its a cli version of Wireshark which is undoubtedly the best in its league.

tshark -D

This shows the available adapters but in container we will only have 2 eth0 and loop back one.

We use a Capture Filter in order to filter out the packets.

-f "tcp port 50010"

We use it on port 50010

-P -w namenode2.pcap

To write the file and -P to write in file as well as on screen.

cd /home

To go to the linked directory so we can see the files in main OS.

Creating pcap files with different name of each node.

On data node1

  tshark  -P -f "tcp port 50010" -w namenode1.pcap

On data node2

  tshark  -P -f "tcp port 50010" -w namenode2.pcap

On data node3

  tshark  -P -f "tcp port 50010" -w namenode3.pcap

Here we have a issue with Tshark on RedHat/Centos and also as the permissions in Base Linux and Containers varies so we tshark will show the error so we create the files in Linux under

/home/leo/Desktop/dockhaddop

Here cause of which i faced a small bug in tshark which was closed earlier but logically seems to be same for the container due to the conflict of permissions which can be seen in below image as well.

Resolving the Problem with tshark

We create the pcap files before hands and give it suitable permissions.

cd /home/leo/Desktop/dockhaddop
touch namenode1.pcap namenode2.pcap namenode3.pcap
chmod 666 *

Now as we are done with the SetUp part lets come to analysis.

Whole Part is now shown in the Video so am not considering the practical steps here.

video url

These are the observation of 2nd video on the blog observation of video on linkedin is given at the end in blog with 2nd video also

second video can be viewed there as video quality and size is restricted on LinkedIn :(

Visit the below mentioned Link (Not able to update it as hyper Link) something to do with linkedIn and Google.

https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html

The Observations

So the First and Last Packet Arrival times are given below (we will do a sort of Relative Packet Analysis Here with node 1,2&3 Resp.

For node1

Arrival time of 1st packet

Oct 16, 2020 05:01:48.520364822 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.955903596 IST

For node 2

Arrival time of 1st packet

Oct 16, 2020 05:01:48.483643564 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.955843610 IST

For node 3

Arrival time of 1st packet

Oct 16, 2020 05:01:48.522872353 IST

Arrival time of Last Packet

Oct 16, 2020 05:01:52.949596427 IST

So here we can clearly see the Packet First Arrives at Node2 then at Node1 then at Node3.

Node3 Finishes the writing part first then Node2 then Node1

From this we can infer the connections were not made in parallel.

But from the end time we can infer than connections were going on while the other node was connected.

So from this we can infer the connections are not made in Parallel it might get delay according to the decisions made by the cluster.

From the Deep packet analysis I got to know that master node contacted the node2 first then node 2 contacted the node1 and node3 was never contacted by the master node and replicas were made and it was contacted by node2 first then by node1 just as to create the replicas.

The nodes communicates among themselves first then creates the blocks and the replicas. While making and breaking connections as they are needed either Parallel or in Series.

So we infer that master node only contacts only single node initially and instructs where to send the other block and the slave node then creates the other part of the block on other nodes (here in node1) and master contact the other node only to Sync data or to Partially transfer the data and replicas are made by the slaves themselves.

Shows node3 was never contacted by ip 172.17.0.2 i.e master t nodes Ip as there is no packet from that IP.

Screenshots of namenode3.pcap captured from node3

Upper pic shows Node3 was contacted by Node2 first and Node1 but not by master node any any point in transfer of data.(but it was used to create the Replicas)

Screenshots of namenode1.pcap captured from node1

Node1 was contacted by node2 first (upper Image)

Node1 contacted by master node in middle of whole session (upper screenshot).

Screenshots of namenode2.pcap captured from node2

Node2 was the only one contacted first by master node.

So we can infer that there is no as such serial or parallel communication in hadoop. But as hadoop master communicates with only single node hence it is not completely parallel but as datanode do communicate with other node so its a sort of parallel, but can't solve the data velocity problem to a great extent in client only connects to the masternode but as it might be instructed by masternode to contact datanode velocity problem can be solved.

hence proved MythBusted.

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

get all the pcap files here

Blog Url

https://isdatabig.blogspot.com/2020/10/the-problem-statement-according-to.html

Video 1 url

Observation of 1st (above) video

??so it took 1 sec location 1,0 in terminal hadoop_datanode1 172.17.0.3

Arrival time Oct 16, 2020 04:45:34.778896426 IST

connection is closed here with a RST

Last end time Oct 16, 2020 04:45:35.121781426 IST

so it took 1 sec location 0,1 in terminal hadoop_datanode2 172.17.0.4

Oct 16, 2020 04:45:34.810703728 IST

Oct 16, 2020 04:45:35.121750686 IST

location 1,1 in terminal hadoop_datanode3 172.17.0.5

Oct 16, 2020 04:45:34.845470669 IST

Oct 16, 2020 04:45:35.114580650 IST

In above observation the node1 was contacted first as hadoop uses Round Robin Algo. And the concept of Observation before can also we used here.

So the Result of Observation1 matches that on Observation2 hence I can say the analysis is Correct Logically.

Video2 on blog

Thank You for your time

Anshika Sharma

Tech Writer@medium ★ Content creator@youtube ★ 2x Red Hat Certified (RHCE) & Openshift Administration I ★ AI Infrastructure Engineer ★ Sr. DevOps Engineer ★

4 年

Great work ????

1 次回应

Tharak Ram

Looking for new opportunities in Germany Site Reliability Engineer II @ JP Morgan Chase. @IUDX, @IISC

4 年

Great work brother ????

1 次回应

Laveena Jethani

Salesforce DevOps Engineer at Red Hat

4 年

Very different approach hadoop on docker container amazing ????. I have seen first post like this and more knowledgable post . Actually you proved all technologies are integrated. Great work Vaibhav S. ????.

2 次回应

Aaditya Tiwari

DevOps Engineer @Amdocs

4 年

??????

2 次回应

查看更多评论

要查看或添加评论，请登录

Vaibhav S.的更多文章

Kubernetes (K8S), how to implement efficient and secure HoneyPots,and how did I used it for Honey Pots and IDS/IPS

2020年12月26日

Kubernetes (K8S), how to implement efficient and secure HoneyPots,and how did I used it for Honey Pots and IDS/IPS

Let's begin with one of the basic question about the intro to K8S. What do comes in our mind when we think of…

14 条评论
Configuring Hadoop Name&DataNode Using Ansible with Firewall Dynamically

2020年12月2日

Configuring Hadoop Name&DataNode Using Ansible with Firewall Dynamically

Aim: Configuring Hadoop Data node and Master node Dynamically creating the firewall rule using Ansible. Ansible Should…
Ansible Use cases and Case Study

2020年12月1日

Ansible Use cases and Case Study

What is Ansible ? Let us go according to official definition. Ansible is an open-source software provisioning…

2 条评论
Deploying Load Balancer and HTTPD using Ansible

2020年11月3日

Deploying Load Balancer and HTTPD using Ansible

Statement: Deploy a Load Balancer and multiple Web Servers on AWS instances through ANSIBLE! -Provision EC2 instances…
Configuring HTTPD over EC2 using Ansible using Dynamic Inventory

2020年11月1日

Configuring HTTPD over EC2 using Ansible using Dynamic Inventory

Aim: To launch HTTPD over AWS EC2 using Ansible with the help of Dynamic Inventory Steps Installing Boto pip install…

2 条评论
Create High Availability Architecture with AWS CLI

2020年11月1日

Create High Availability Architecture with AWS CLI

AIM:The architecture includes- - Webserver configured on EC2 Instance - Document Root(/var/www/html) made persistent by…
Machine Learning And Its Use Cases

2020年10月20日

Machine Learning And Its Use Cases

What is Machine Learning? Machine learning is not about a single factor that it can be defined by some words but in…

6 条评论
Threats and use-cases where Cloud(AWS) can help

2020年9月24日

Threats and use-cases where Cloud(AWS) can help

Lets look around for a second aren't we surrounded by a lots of device (maybe not) but if you look a broad picture then…

20 条评论
Configuring HTTPD on Docker using Ansible

2020年8月27日

Configuring HTTPD on Docker using Ansible

Docker is a nice alternative if you don't want to deploy httpd on your base system or OS. Containerization of httpd is…

See all articles

Hadoop Cluster Revealed

Vaibhav S.

Lead Cybersecurity Engineer | Cybersecurity Engineering | ex-PwC | Helping Companies Prevent Cyberattacks | RHCSA | RHCE | eJPT | CEH(P) | ICCA | RHCSSMA | CCP | CSA | CIAP-DIAT| CSIL-CDWI | CSIL-COA

How it went of?

Vaibhav S.的更多文章

社区洞察

其他会员也浏览了

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

How I've set up my first Hadoop / Spark cluster: Preparation

How to Connect SQL Server 2019 Dev to Hadoop System 3.1.3

Apache Spark on YARN Architecture

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

Hadoop 3: Comparison with Hadoop 2 and Spark

Setup a Multi-Node Hadoop Cluster using Docker

Best Ways to Use Hadoop with R for Extraordinary Results!

Contribute Limited Amount Of Storage Of DataNode In Hadoop Cluster

How it went of?

Vaibhav S.的更多文章

Kubernetes (K8S), how to implement efficient and secure HoneyPots,and how did I used it for Honey Pots and IDS/IPS

Configuring Hadoop Name&DataNode Using Ansible with Firewall Dynamically

Ansible Use cases and Case Study

Deploying Load Balancer and HTTPD using Ansible

Configuring HTTPD over EC2 using Ansible using Dynamic Inventory

Create High Availability Architecture with AWS CLI

Machine Learning And Its Use Cases

Threats and use-cases where Cloud(AWS) can help

Configuring HTTPD on Docker using Ansible

社区洞察

其他会员也浏览了

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

How I've set up my first Hadoop / Spark cluster: Preparation

How to Connect SQL Server 2019 Dev to Hadoop System 3.1.3

Apache Spark on YARN Architecture

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

Hadoop 3: Comparison with Hadoop 2 and Spark

Setup a Multi-Node Hadoop Cluster using Docker

Best Ways to Use Hadoop with R for Extraordinary Results!

Contribute Limited Amount Of Storage Of DataNode In Hadoop Cluster