登录查看更多内容

Demystifying the concept Of Parallelism when Upload the BigData in Hadoop

Mohamed Afrid

DevOps Engineer | 3x RedHat | 1x AWS | 3x Azure | CKA | CKS | Terraform Certified

发布日期: 2020年10月23日

Task Description :

??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

?? In this article We will research some more things like Who upload Data/file at DataNode & How replication works.

???To perform this article We will follow below steps -

A.. NameNode Configuration

B. DataNode Configuration

C. Client Node Configuration

D. Find Who Upload Data at DataNode ( Client or NameNode ) & How Replication works.

E. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not.

We have to create the below setup to test

A. NameNode Configuration ("NN")-

A (1) Create "/nn" Directory -

# mkdir /nn

A (2) "hdfs-site.xml" file configuration in /etc/hadoop/ directory

A (3) "core-site.xml" configuration in /etc/hadoop/ directory

A (4) Format NameNode -

A (5) Stop Firewalld -

# systemctl stop firewalld

A (6) Start NameNode -

B. DataNode Configuration ("DN1" , "DN2" , "DN3" ) -

B (1) Create "/dn" -

B (2) "hdfs-site.xml" file configuration -

B (3) "core-site.xml" file configuration -

B (4) Stop Firewalld -

# systemctl stop firewalld

B (5) Start DataNode -

C. Client Configuration ("Clt1")-

C (1) "hdfs-site.xml" configuration

C (2) Stop Firewalld -

# systemctl stop firewalld

C (3) Check Client is Ready or Not -

In my case it is ready .

D. Find Who Upload Data at DataNode ( "Client" or "NameNode" ) & How Replication works -

* Connection between NameNode and Client port 9001 works because NameNode is working on 9001 port & we used "9001" port at Client "core-site.xml" file.

* port "50010" port is used to transfer data at DataNode. Now we want to know that Client directly transfer data to DataNode or Client transfer data to DataNode through NameNode . You can better understand through below pic "what we want to know" -

* To perform this task We will use Three terminals of Client Node -

> "Client Terminal - 1" - To check connection at client on Port 9001

> "Client Terminal - 2" - To check connection at client on Port 50010

> "Client Terminal - 3" - To upload file .

First We will check Case -2 then after we check Case-1

D > Case - 2

* In this case We will check "50010" port connection at NameNode because in Hadoop Cluster data is transfer at "50010" by default. If any packets will pass through port 50010 at NameNode then we can say that data is transferring through NameNode.

* At Client Node We have a file "dn.txt" .

D > 2 (i). File content is -

# cat > test.txt

Hey , How are you?

D > 2 (ii). Connection On Ports "9001" & "50010" -

* In "Client Terminal - 1" We will check connection of client on Port "9001" & In "Client Terminal - 3" We will upload file & on NameNode we will check connection on Port "50010".

4 > 2 (C) Now We are uploading file -

* When We upload file "dn.txt" then at NameNode on Port "50010" no network packets are passing ,

So we can say that In Hadoop Cluster NameNode don't upload the file to DataNode.

D Case - 1

D > 1 (i) Information About File Which Will Upload By Client -

* In this case We will check all connection at Client Node on ports "9001" & "50010". For this I have three terminal of Client Node.

* At "Client Terminal - 1" We will check connection for port "9001" and "Client Terminal - 2" We will check connection for port "50010" & "Client Terminal - 3" We will upload file "text-client.txt".

* Content of "hello.txt" file -

hello
what are you doing?

D > 1 (ii) Connection at Port 9001 & 50010 -

* We use "-X" to see network packet content.

* Now We are running Command at "Client Terminal - 1" & "Client Terminal - 2"

D > 1 (iii) Upload File "text-client.txt" -

* When We upload file from "Client Terminal - 3" we see that many network packet are going through "9001" & "50010" ports . We can see these packet in "Client Terminal - 1" & "Client Terminal - 2" respectively.

When we see in "Client Terminal - 2" then we find that Client is connecting to DataNode - "DN2" & transferring directly.

* Now we can say that Client is one who upload data at DataNode.

D > 1 (iv) Find How Client knows IPs of DataNodes -

* But here A issue will raise that How Client Node know that "what is the IPs of DataNodes".

* To solve this issue when we see Network Packets of port "9001" where Client is connecting to NameNode . then we find that Client Node is taking IPs of DataNodes from NameNode.

Now this issue is solved

Till here we draw a connection diagram

4 > 1(E) How Replication Process Works -

* Another Issue will raise that at Terminal when We see whole network packets then we find that Client is connecting Only to "DataNode - DN1" but file is uploading on all three DataNodes because by default value of replica is 3 but Client is connecting to only one DataNode then "How can be possible that file is uploading on all remaining DataNodes ?". ( Now How Replication is possible ) - { here by default block size is 64 MiB & file size is very smaller then 64MiB so only one block will create }

* To solve this when we see again Network Packets of "Client Terminal " for port "50010" then we find that Client is sending remaining DataNode's IP's to DataNode - "DN" Now We can think that it can possible that DataNode "DN2" is connecting to another DataNode for uploading file.

* To know this We will upload a another file "gb.txt" from "Client Terminal" & at that time We will check connection on port "50010" all DataNodes & also check connection on "50010" port at "Client Terminal"

* We are running tcpdump command on DataNode - "DN1" , DataNode - "DN2" , DataNode - "DN3" & "Client Terminal".

* Now We are uploading "gb.txt" file

* We can see on "Client Terminal " Client is connecting to "DN1"and for transferring files

* Client is also sending remaining DataNode's IP's ( DD2 ) along with the data

, we have already proved it . Now we want to know that if client is not connecting to other DataNode then who is transferring data to those two DataNodes.

here you can see the received data in the DataNode-1

* For this when We see Network packets on DataNode - "DN2" then we find that DataNode - "DN1" is connecting to DataNode - "DN2" and the packets receiving from DD1 --> DD2 and packets transferring from DD2--> DD3

* When we see Network Packets at DataNode - "DN1" then we find that DataNode - "DN2" is also sending the remaining DataNode - "DN3" IP .

* Till here we draw the connection diagram -

* Now We see Network Packets of DataNode - "DN3" then we find that DataNode - "DN2" is connecting to DataNode - "DN3" and sending the data to DD3

* Now we draw again connection diagram :

* Replication works according to above connection diagram between DataNodes.

In this case block size is 1 because our file size is less than 64MiB & we didn't change by default block size . So with the help of above connection diagram we can say that to store one block at DataNodes Client only connect with one DataNode .

Now this DataNode will create replica of that block on other DataNodes.

If Client will connect other Datanode then we can say that it will definitely uploading new block because client is the one who upload the block at DataNode directly.

Task Description :

A.. NameNode Configuration

B. DataNode Configuration

C. Client Node Configuration

D. Find Who Upload Data at DataNode ( Client or NameNode ) & How Replication works.

E. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not.

We have to create the below setup to test

A. NameNode Configuration ("NN")-

A (1) Create "/nn" Directory -

A (2) "hdfs-site.xml" file configuration in /etc/hadoop/ directory

A (3) "core-site.xml" configuration in /etc/hadoop/ directory

A (4) Format NameNode -

A (5) Stop Firewalld -

A (6) Start NameNode -

B. DataNode Configuration ("DN1" , "DN2" , "DN3" ) -

B (1) Create "/dn" -

B (2) "hdfs-site.xml" file configuration -

B (3) "core-site.xml" file configuration -

B (4) Stop Firewalld -

B (5) Start DataNode -

C. Client Configuration ("Clt1")-

C (1) "hdfs-site.xml" configuration

C (2) Stop Firewalld -

C (3) Check Client is Ready or Not -

In my case it is ready .

D. Find Who Upload Data at DataNode ( "Client" or "NameNode" ) & How Replication works -

> "Client Terminal - 1" - To check connection at client on Port 9001

> "Client Terminal - 2" - To check connection at client on Port 50010

> "Client Terminal - 3" - To upload file .

D > Case - 2

D > 2 (i). File content is -

D > 2 (ii). Connection On Ports "9001" & "50010" -

4 > 2 (C) Now We are uploading file -

D Case - 1

D > 1 (i) Information About File Which Will Upload By Client -

D > 1 (ii) Connection at Port 9001 & 50010 -

* Now We are running Command at "Client Terminal - 1" & "Client Terminal - 2"

D > 1 (iii) Upload File "text-client.txt" -

D > 1 (iv) Find How Client knows IPs of DataNodes -

Now this issue is solved

Till here we draw a connection diagram

4 > 1(E) How Replication Process Works -

* Now we draw again connection diagram :

This is how the replication works in solving the velocity problem in BigData world.

Thanks for Reading....

Will try Demystify more in future :-)

WORDPRESS AND MYSQL integration USING AWS RDS AND KUBERNETES

2020年10月21日

Auto-configuration of ASG launched instances as webserver behind the Loadbalancer ( Reverse proxy using ha-proxy )

2020年8月26日

Dynamic inventory for AWS instance ( provisioning and configuration by ansible )

2020年8月21日

Multi-node Kube-cluster configuration using Ansible

2020年8月12日

Ansible to configure web-server as a container in docker

2020年8月11日

CloudOps Automation Using Terraform:

2020年7月17日

Isolated Multi-tenant architecture via Terraform ( IaaC ). v2

2020年7月13日

Isolated Multi-tenant architecture via Terraform ( IaaC )

2020年7月13日

AWS EKS - Multi-Tier HA

2020年7月11日

DevOps Assembly ( CI / CD / CM )

2020年6月26日

社区洞察

其他会员也浏览了

Optimize your EMR cluster

BigData-Hadoop

Contribute Limited Amount Of Storage Of DataNode In Hadoop Cluster

HCatalog and Pig Integration | Accessing Pig With HCatalog

Hadoop Cluster Revealed

Integrating LVM with Hadoop and providing Elasticity to DataNode Storage

All You Need To Know About Parquet File Structure In Depth

Let’s research and the world the know about the Myths of Hadoop

How To Provide Elasticity Storage To Hadoop Slave From LVM (Logical Volume Management) ? BigData

How Client put the file, read the file in Hadoop Cluster and How it retrieve data when DataNode is crashed