Replication in Hadoop

Replication in Hadoop

Replication of Data is extremely important in today's world. This ensures Durability of data where you do not have a Single Point of Failure and your setup is fault tolerant.

Replication Factor is the no. of nodes a block of data is copied to.

It is solely upon the Client to decide the replication factor for a file depending on the importance of file. By default its value is 3. Client can decide the RF while uploading the file or by configuring it in hdfs-site.xml

#hadoop fs -Ddfs.replication=4 -put t3.txt /

OR /etc/hadoop/hdfs-site.xml

.
.
.
<configuration>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
</configuration>
.
.
No alt text provided for this image
No alt text provided for this image

I had explained earlier how data transfer takes place directly between Client and DataNodes : https://www.dhirubhai.net/pulse/hadoop-breaking-myths-proof-ishan-singhal

No alt text provided for this image
No alt text provided for this image



No alt text provided for this image

How does replication happen ?

Client receives IP of Nodes from NameNode and copies data block to one DataNode. That DataNode then replicates that block to other DataNodes.


Reading a file and what happens if a DataNode shuts down / crashes during a read?

#hadoop fs -cat /t3.txt

We observe through tcpdump 9001 port that Client gets the locations from NameNode.

No alt text provided for this image

By observing port 50010 I see that it is getting data from DataNode 3, so I decide to terminate it midway.

No alt text provided for this image

Data read stops

No alt text provided for this image

After a while I see that data transfer again continues from DataNode 2

No alt text provided for this image
No alt text provided for this image

Hence we see how replicas help in fault tolerance.

Anurag Sharma

Project Engineer @ Wipro

4 年

Great work bro !!!

回复

要查看或添加评论,请登录

Ishan Singhal的更多文章

社区洞察

其他会员也浏览了