Replication in Hadoop
Replication of Data is extremely important in today's world. This ensures Durability of data where you do not have a Single Point of Failure and your setup is fault tolerant.
Replication Factor is the no. of nodes a block of data is copied to.
It is solely upon the Client to decide the replication factor for a file depending on the importance of file. By default its value is 3. Client can decide the RF while uploading the file or by configuring it in hdfs-site.xml
#hadoop fs -Ddfs.replication=4 -put t3.txt /
OR /etc/hadoop/hdfs-site.xml
. . . <configuration> <property> <name>dfs.replication</name> <value>4</value> </property> </configuration> . .
I had explained earlier how data transfer takes place directly between Client and DataNodes : https://www.dhirubhai.net/pulse/hadoop-breaking-myths-proof-ishan-singhal
How does replication happen ?
Client receives IP of Nodes from NameNode and copies data block to one DataNode. That DataNode then replicates that block to other DataNodes.
Reading a file and what happens if a DataNode shuts down / crashes during a read?
#hadoop fs -cat /t3.txt
We observe through tcpdump 9001 port that Client gets the locations from NameNode.
By observing port 50010 I see that it is getting data from DataNode 3, so I decide to terminate it midway.
Data read stops
After a while I see that data transfer again continues from DataNode 2
Project Engineer @ Wipro
4 年Great work bro !!!