登录查看更多内容

HDFS Architecture in Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

发布日期: 2019年2月10日

Hadoop consists of mainly two main core components HDFS, MapReduce. HDFS is the Hadoop Distributed File System ( HDFS ) where the data is stored. It uses Master-Slave architecture to distribute, store and retrieve the data efficiently.

As part of this blog, I will be explaining the way architecture is designed to be fault tolerant, the details such as replication factor, locations, racks, block id, size & the health status of a file.

The default replication factor can be set via hdfs-site.xml

We can also change the replication factor on a per-file basis using the Hadoop FS shell.

Alternatively, you can change the replication factor of all the files under a directory.

On copying a file to hdfs, it is split according to the block size and distributed across the data nodes. The default block-size can be changed using the below configuration.

Now let’s copy a file from the local file system (LFS) to Hadoop distributed file system (HDFS) and see how the data is being copied and what happens internally.

NameNode has all the metadata such as the replication factor, locations, racks etc… related to the file. We can view this information on executing the below command.

On running the above command the gateway node runs the fsck and connects to the Namenode. Namenode checks for the file and the time it was created.

Next, the Namenode will go to the particular block pool id of the Namenode which contains the metadata information.

Based on the block pool id, it will search for the block id of the data node and the details such as the rack information on which the data is stored based on the replication factor.

Further, it will give you the information regarding the blocks which are Over-replicated, Under-replicated, corrupt blocks, the number of data nodes and the racks used along with the health status of the file system.

Apart from this, the scheduler also plays a role in distributing the resources and scheduling a job on storing data into Hdfs. In this case, I’m using Yarn architecture. The details related to the scheduling are present in yarn-site.xml. The default scheduler used is capacity scheduler.

The commands that were executed related to this post are added as part of my GIT account.

Note: Similarly, you can also read about Hive Architecture in Depth with code.

If you found this article useful, please click on the like, share button and let others know about it. Further, if you would like me to add anything else, please feel free to leave a response ??

Daniel Beach

Senior Data Engineer at Rippleshot

6 年

Great post! I just spent hours troubleshooting Spark jobs running inside YARN. Turns out even though I could do passwordless ssh, telnet into the my hadoop data nodes wasn't working. I'm learning Hadoop can be tricky and the devils in the details!

Jayvardhan Reddy Vanchireddy

6 年

https://github.com/Jayvardhan-Reddy/BigData-Ecosystem-Architecture

查看更多评论

要查看或添加评论，请登录

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

2019年11月10日

Apache Spark-3.0 Sneek peak

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing…

14 条评论
Working of Sqoop Incremental Load

2019年7月15日

Working of Sqoop Incremental Load

In my series of BigData Architecture, we have seen the internal working of Sqoop. Now as part of this article, we'll…

3 条评论
Deep-dive into Spark Internals & Architecture

2019年5月12日

Deep-dive into Spark Internals & Architecture

Apache Spark is an open-source distributed general-purpose cluster-computing framework. A spark application is a JVM…

12 条评论
Sqoop Architecture in Depth

2019年3月2日

Sqoop Architecture in Depth

Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and…

9 条评论
Hive Architecture in?Depth

2018年11月6日

Hive Architecture in?Depth

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop for data summarization, analysis and querying of…
Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

2018年8月25日

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

As a developer, we frequently debug the application during the development activities. The real time applications are…

1 条评论
5 Useful Tools for a Full-stack Developer

2018年8月12日

5 Useful Tools for a Full-stack Developer

The below tools will help you increase your productivity and reduce compilation issues on running a debug job…

1 条评论
Database Transaction Leak in Java Application

2018年8月5日

Database Transaction Leak in Java Application

In a real time application the Database leak occurs due to Unclosed transactions created by the programmers. The stakes…

1 条评论
Analysis of Memory Leak in Java Applications via Heap?Dump

2018年7月22日

Analysis of Memory Leak in Java Applications via Heap?Dump

Memory plays a vital role in any application performance and we cannot afford to waste the resources unnecessarily, as…

6 条评论
Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

2018年7月21日

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

Every Programmer is bound to use these tools at some point of time, As it plays a vital role in optimizing the…

See all articles

HDFS Architecture in Depth

Jayvardhan Reddy Vanchireddy

Senior Data Engineer at Cognizant?? | Ex-Honeywell | #ONO ?? | #Azure ? | #German B1 Level Certified ???? | Writer@Medium ? | #BigData Engineer ??

Jayvardhan Reddy Vanchireddy的更多文章

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Hadoop vs Hive

Hadoop File Formats, when and what to use?

Hadoop Distributed File Storage

Hadoop Architecture Made Easy!

HDFS (Hadoop Distributed File System):

Hadoop: Pioneering the Era of Big Data Storage Technologies

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

The 9 main applications of the Hadoop Ecosystem

Configuration of HDFS Cluster with Ansible

Jayvardhan Reddy Vanchireddy的更多文章

Apache Spark-3.0 Sneek peak

Working of Sqoop Incremental Load

Deep-dive into Spark Internals & Architecture

Sqoop Architecture in Depth

Hive Architecture in?Depth

Application Development: 4 Simple Steps to Resolve Remote Debugging Connection Problems

5 Useful Tools for a Full-stack Developer

Database Transaction Leak in Java Application

Analysis of Memory Leak in Java Applications via Heap?Dump

Heap dump generation & Analysis using JMAP, JMAT, JvisualVM Tools

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Hadoop vs Hive

Hadoop File Formats, when and what to use?

Hadoop Distributed File Storage

Hadoop Architecture Made Easy!

HDFS (Hadoop Distributed File System):

Hadoop: Pioneering the Era of Big Data Storage Technologies

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

The 9 main applications of the Hadoop Ecosystem

Configuration of HDFS Cluster with Ansible