Demystifying the concept Of Parallelism when Upload the BigData in Hadoop

Demystifying the concept Of Parallelism when Upload the BigData in Hadoop


Task Description :

??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

?? In this article We will research some more things like Who upload Data/file at DataNode & How replication works.


???To perform this article We will follow below steps -

A.. NameNode Configuration

B. DataNode Configuration

C. Client Node Configuration

D. Find Who Upload Data at DataNode ( Client or NameNode ) & How Replication works.

E. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not.


We have to create the below setup to test

No alt text provided for this image


A. NameNode Configuration ("NN")-

A (1) Create "/nn" Directory -

# mkdir /nn

A (2) "hdfs-site.xml" file configuration in /etc/hadoop/ directory

No alt text provided for this image

A (3) "core-site.xml" configuration in /etc/hadoop/ directory

No alt text provided for this image


A (4) Format NameNode -

No alt text provided for this image


A (5) Stop Firewalld -

# systemctl stop firewalld

A (6) Start NameNode -

No alt text provided for this image


B. DataNode Configuration ("DN1" , "DN2" , "DN3" ) -

B (1) Create "/dn" -

No alt text provided for this image

B (2) "hdfs-site.xml" file configuration -

No alt text provided for this image


B (3) "core-site.xml" file configuration -

No alt text provided for this image


B (4) Stop Firewalld -

# systemctl stop firewalld

B (5) Start DataNode -

No alt text provided for this image


C. Client Configuration ("Clt1")-


C (1) "hdfs-site.xml" configuration

No alt text provided for this image


C (2) Stop Firewalld -

# systemctl stop firewalld


C (3) Check Client is Ready or Not -

No alt text provided for this image

In my case it is ready .


D. Find Who Upload Data at DataNode ( "Client" or "NameNode" ) & How Replication works -

* Connection between NameNode and Client port 9001 works because NameNode is working on 9001 port & we used "9001" port at Client "core-site.xml" file.

* port "50010" port is used to transfer data at DataNode. Now we want to know that Client directly transfer data to DataNode or Client transfer data to DataNode through NameNode . You can better understand through below pic "what we want to know" -

No alt text provided for this image


* To perform this task We will use Three terminals of Client Node -

> "Client Terminal - 1" - To check connection at client on Port 9001

> "Client Terminal - 2" - To check connection at client on Port 50010

> "Client Terminal - 3" - To upload file .

First We will check Case -2 then after we check Case-1

D > Case - 2

* In this case We will check "50010" port connection at NameNode because in Hadoop Cluster data is transfer at "50010" by default. If any packets will pass through port 50010 at NameNode then we can say that data is transferring through NameNode.

* At Client Node We have a file "dn.txt" .

D > 2 (i). File content is -

# cat > test.txt

Hey , How are you?

D > 2 (ii). Connection On Ports "9001" & "50010" -

* In "Client Terminal - 1" We will check connection of client on Port "9001" & In "Client Terminal - 3" We will upload file & on NameNode we will check connection on Port "50010".

4 > 2 (C) Now We are uploading file -

* When We upload file "dn.txt" then at NameNode on Port "50010" no network packets are passing ,


No alt text provided for this image
So we can say that In Hadoop Cluster NameNode don't upload the file to DataNode.




D Case - 1

D > 1 (i) Information About File Which Will Upload By Client -


* In this case We will check all connection at Client Node on ports "9001" & "50010". For this I have three terminal of Client Node.

* At "Client Terminal - 1" We will check connection for port "9001" and "Client Terminal - 2" We will check connection for port "50010" & "Client Terminal - 3" We will upload file "text-client.txt".

* Content of "hello.txt" file -

hello
what are you doing?


D > 1 (ii) Connection at Port 9001 & 50010 -

* We use "-X" to see network packet content.

No alt text provided for this image

* Now We are running Command at "Client Terminal - 1" & "Client Terminal - 2"


D > 1 (iii) Upload File "text-client.txt" -

* When We upload file from "Client Terminal - 3" we see that many network packet are going through "9001" & "50010" ports . We can see these packet in "Client Terminal - 1" & "Client Terminal - 2" respectively.

No alt text provided for this image

When we see in "Client Terminal - 2" then we find that Client is connecting to DataNode - "DN2" & transferring directly.

No alt text provided for this image


Now we can say that Client is one who upload data at DataNode.



D > 1 (iv) Find How Client knows IPs of DataNodes -

* But here A issue will raise that How Client Node know that "what is the IPs of DataNodes".

* To solve this issue when we see Network Packets of port "9001" where Client is connecting to NameNode . then we find that Client Node is taking IPs of DataNodes from NameNode.

No alt text provided for this image


Now this issue is solved

Till here we draw a connection diagram

No alt text provided for this image




4 > 1(E) How Replication Process Works -

Another Issue will raise that at Terminal when We see whole network packets then we find that Client is connecting Only to "DataNode - DN1" but file is uploading on all three DataNodes because by default value of replica is 3 but Client is connecting to only one DataNode then "How can be possible that file is uploading on all remaining DataNodes ?". ( Now How Replication is possible ) - { here by default block size is 64 MiB & file size is very smaller then 64MiB so only one block will create }

* To solve this when we see again Network Packets of "Client Terminal " for port "50010" then we find that Client is sending remaining DataNode's IP's to DataNode - "DN" Now We can think that it can possible that DataNode "DN2" is connecting to another DataNode for uploading file.

* To know this We will upload a another file "gb.txt" from "Client Terminal" & at that time We will check connection on port "50010" all DataNodes & also check connection on "50010" port at "Client Terminal"


* We are running tcpdump command on DataNode - "DN1" , DataNode - "DN2" , DataNode - "DN3" & "Client Terminal".

* Now We are uploading "gb.txt" file

* We can see on "Client Terminal " Client is connecting to "DN1"and for transferring files

No alt text provided for this image


* Client is also sending remaining DataNode's IP's ( DD2 ) along with the data

No alt text provided for this image


, we have already proved it . Now we want to know that if client is not connecting to other DataNode then who is transferring data to those two DataNodes.

No alt text provided for this image

here you can see the received data in the DataNode-1

No alt text provided for this image


* For this when We see Network packets on DataNode - "DN2" then we find that DataNode - "DN1" is connecting to DataNode - "DN2" and the packets receiving from DD1 --> DD2 and packets transferring from DD2--> DD3

No alt text provided for this image

* When we see Network Packets at DataNode - "DN1" then we find that DataNode - "DN2" is also sending the remaining DataNode - "DN3" IP .

No alt text provided for this image

* Till here we draw the connection diagram -


No alt text provided for this image


* Now We see Network Packets of DataNode - "DN3" then we find that DataNode - "DN2" is connecting to DataNode - "DN3" and sending the data to DD3

No alt text provided for this image


* Now we draw again connection diagram :


No alt text provided for this image


* Replication works according to above connection diagram between DataNodes.
In this case block size is 1 because our file size is less than 64MiB & we didn't change by default block size . So with the help of above connection diagram we can say that to store one block at DataNodes Client only connect with one DataNode .

Now this DataNode will create replica of that block on other DataNodes.

If Client will connect other Datanode then we can say that it will definitely uploading new block because client is the one who upload the block at DataNode directly.


This is how the replication works in solving the velocity problem in BigData world.


Thanks for Reading....

Will try Demystify more in future :-)



要查看或添加评论,请登录

社区洞察

其他会员也浏览了