Demystifying the concept Of Parallelism when Upload the BigData in Hadoop
Mohamed Afrid
DevOps Engineer | 3x RedHat | 1x AWS | 3x Azure | CKA | CKS | Terraform Certified
Task Description :
??According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.
?? In this article We will research some more things like Who upload Data/file at DataNode & How replication works.
???To perform this article We will follow below steps -
A.. NameNode Configuration
B. DataNode Configuration
C. Client Node Configuration
D. Find Who Upload Data at DataNode ( Client or NameNode ) & How Replication works.
E. Check Hadoop Use Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not.
We have to create the below setup to test
A. NameNode Configuration ("NN")-
A (1) Create "/nn" Directory -
# mkdir /nn
A (2) "hdfs-site.xml" file configuration in /etc/hadoop/ directory
A (3) "core-site.xml" configuration in /etc/hadoop/ directory
A (4) Format NameNode -
A (5) Stop Firewalld -
# systemctl stop firewalld
A (6) Start NameNode -
B. DataNode Configuration ("DN1" , "DN2" , "DN3" ) -
B (1) Create "/dn" -
B (2) "hdfs-site.xml" file configuration -
B (3) "core-site.xml" file configuration -
B (4) Stop Firewalld -
# systemctl stop firewalld
B (5) Start DataNode -
C. Client Configuration ("Clt1")-
C (1) "hdfs-site.xml" configuration
C (2) Stop Firewalld -
# systemctl stop firewalld
C (3) Check Client is Ready or Not -
In my case it is ready .
D. Find Who Upload Data at DataNode ( "Client" or "NameNode" ) & How Replication works -
* Connection between NameNode and Client port 9001 works because NameNode is working on 9001 port & we used "9001" port at Client "core-site.xml" file.
* port "50010" port is used to transfer data at DataNode. Now we want to know that Client directly transfer data to DataNode or Client transfer data to DataNode through NameNode . You can better understand through below pic "what we want to know" -
* To perform this task We will use Three terminals of Client Node -
> "Client Terminal - 1" - To check connection at client on Port 9001
> "Client Terminal - 2" - To check connection at client on Port 50010
> "Client Terminal - 3" - To upload file .
First We will check Case -2 then after we check Case-1
D > Case - 2
* In this case We will check "50010" port connection at NameNode because in Hadoop Cluster data is transfer at "50010" by default. If any packets will pass through port 50010 at NameNode then we can say that data is transferring through NameNode.
* At Client Node We have a file "dn.txt" .
D > 2 (i). File content is -
# cat > test.txt Hey , How are you?
D > 2 (ii). Connection On Ports "9001" & "50010" -
* In "Client Terminal - 1" We will check connection of client on Port "9001" & In "Client Terminal - 3" We will upload file & on NameNode we will check connection on Port "50010".
4 > 2 (C) Now We are uploading file -
* When We upload file "dn.txt" then at NameNode on Port "50010" no network packets are passing ,
So we can say that In Hadoop Cluster NameNode don't upload the file to DataNode.
D Case - 1
D > 1 (i) Information About File Which Will Upload By Client -
* In this case We will check all connection at Client Node on ports "9001" & "50010". For this I have three terminal of Client Node.
* At "Client Terminal - 1" We will check connection for port "9001" and "Client Terminal - 2" We will check connection for port "50010" & "Client Terminal - 3" We will upload file "text-client.txt".
* Content of "hello.txt" file -
hello what are you doing?
D > 1 (ii) Connection at Port 9001 & 50010 -
* We use "-X" to see network packet content.
* Now We are running Command at "Client Terminal - 1" & "Client Terminal - 2"
D > 1 (iii) Upload File "text-client.txt" -
* When We upload file from "Client Terminal - 3" we see that many network packet are going through "9001" & "50010" ports . We can see these packet in "Client Terminal - 1" & "Client Terminal - 2" respectively.
When we see in "Client Terminal - 2" then we find that Client is connecting to DataNode - "DN2" & transferring directly.
* Now we can say that Client is one who upload data at DataNode.
D > 1 (iv) Find How Client knows IPs of DataNodes -
* But here A issue will raise that How Client Node know that "what is the IPs of DataNodes".
* To solve this issue when we see Network Packets of port "9001" where Client is connecting to NameNode . then we find that Client Node is taking IPs of DataNodes from NameNode.
Now this issue is solved
Till here we draw a connection diagram
4 > 1(E) How Replication Process Works -
* Another Issue will raise that at Terminal when We see whole network packets then we find that Client is connecting Only to "DataNode - DN1" but file is uploading on all three DataNodes because by default value of replica is 3 but Client is connecting to only one DataNode then "How can be possible that file is uploading on all remaining DataNodes ?". ( Now How Replication is possible ) - { here by default block size is 64 MiB & file size is very smaller then 64MiB so only one block will create }
* To solve this when we see again Network Packets of "Client Terminal " for port "50010" then we find that Client is sending remaining DataNode's IP's to DataNode - "DN" Now We can think that it can possible that DataNode "DN2" is connecting to another DataNode for uploading file.
* To know this We will upload a another file "gb.txt" from "Client Terminal" & at that time We will check connection on port "50010" all DataNodes & also check connection on "50010" port at "Client Terminal"
* We are running tcpdump command on DataNode - "DN1" , DataNode - "DN2" , DataNode - "DN3" & "Client Terminal".
* Now We are uploading "gb.txt" file
* We can see on "Client Terminal " Client is connecting to "DN1"and for transferring files
* Client is also sending remaining DataNode's IP's ( DD2 ) along with the data
, we have already proved it . Now we want to know that if client is not connecting to other DataNode then who is transferring data to those two DataNodes.
here you can see the received data in the DataNode-1
* For this when We see Network packets on DataNode - "DN2" then we find that DataNode - "DN1" is connecting to DataNode - "DN2" and the packets receiving from DD1 --> DD2 and packets transferring from DD2--> DD3
* When we see Network Packets at DataNode - "DN1" then we find that DataNode - "DN2" is also sending the remaining DataNode - "DN3" IP .
* Till here we draw the connection diagram -
* Now We see Network Packets of DataNode - "DN3" then we find that DataNode - "DN2" is connecting to DataNode - "DN3" and sending the data to DD3
* Now we draw again connection diagram :
* Replication works according to above connection diagram between DataNodes.
In this case block size is 1 because our file size is less than 64MiB & we didn't change by default block size . So with the help of above connection diagram we can say that to store one block at DataNodes Client only connect with one DataNode .
Now this DataNode will create replica of that block on other DataNodes.
If Client will connect other Datanode then we can say that it will definitely uploading new block because client is the one who upload the block at DataNode directly.