HDFS Architecture in Depth

HDFS Architecture in Depth

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS ) where the data is stored. It uses Master-Slave architecture to distribute, store and retrieve the data efficiently.

As part of this blog, I will be explaining the way architecture is designed to be fault tolerant, the details such as replication factor, locations, racks, block id, size & the health status of a file.

The default replication factor can be set via hdfs-site.xml

We can also change the replication factor on a per-file basis using the Hadoop FS shell.

Alternatively, you can change the replication factor of all the files under a directory.

On copying a file to hdfs, it is split according to the block size and distributed across the data nodes. The default block-size can be changed using the below configuration.

Now let’s copy a file from the local file system (LFS) to Hadoop distributed file system (HDFS) and see how the data is being copied and what happens internally.

NameNode has all the metadata such as the replication factor, locations, racks etc… related to the file. We can view this information on executing the below command.

On running the above command the gateway node runs the fsck and connects to the Namenode. Namenode checks for the file and the time it was created.

Next, the Namenode will go to the particular block pool id of the Namenode which contains the metadata information.

Based on the block pool id, it will search for the block id of the data node and the details such as the rack information on which the data is stored based on the replication factor.

Further, it will give you the information regarding the blocks which are Over-replicated, Under-replicated, corrupt blocks, the number of data nodes and the racks used along with the health status of the file system.

Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.

The commands that were executed related to this post are added as part of my GIT account.

Note: Similarly, you can also read about Hive Architecture in Depth with code.

If you found this article useful, please click on the like, share button and let others know about it. Further, if you would like me to add anything else, please feel free to leave a response ??

Daniel Beach

Senior Data Engineer at Rippleshot

6 年

Great post! I just spent hours troubleshooting Spark jobs running inside YARN. Turns out even though I could do passwordless ssh, telnet into the my hadoop data nodes wasn't working. I'm learning Hadoop can be tricky and the devils in the details!

回复
Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

6 年
回复

要查看或添加评论,请登录

Jayvardhan Reddy Vanchireddy的更多文章

  • Apache Spark-3.0 Sneek peak

    Apache Spark-3.0 Sneek peak

    Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

    14 条评论
  • Working of Sqoop Incremental Load

    Working of Sqoop Incremental Load

    In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll…

    3 条评论
  • Deep-dive into Spark Internals & Architecture

    Deep-dive into Spark Internals & Architecture

    Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

    12 条评论
  • Sqoop Architecture in Depth

    Sqoop Architecture in Depth

    Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and…

    9 条评论
  • Hive Architecture in?Depth

    Hive Architecture in?Depth

    Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…

  • Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

    As a developer, we frequently debug the application during the development activities. The real time applications are…

    1 条评论
  • 5 Useful Tools for a Full-stack Developer

    5 Useful Tools for a Full-stack Developer

    The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

    1 条评论
  • Database Transaction Leak in Java Application

    Database Transaction Leak in Java Application

    In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

    1 条评论
  • Analysis of Memory Leak in Java Applications via Heap?Dump

    Analysis of Memory Leak in Java Applications via Heap?Dump

    Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

    6 条评论
  • Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

    Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

社区洞察

其他会员也浏览了