HDFS Architecture (Basic concepts)
HDFS is a blocked file system in which each file is split into blocks of predefined size. These blocks are stored in clusters of one or more machines. The architecture of this file system follows the master / slave architecture, such as a cluster, a single name node (Master Node), and all other nodes, which are the DataNode (Slave Nodes). HDFS can be installed on a range of Java-supported machines. Although anyone can run multiple DataNodes on a machine, but in the world of applications, these DataNodes are published among different machines.
NameNode
The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very high-availability server that manages the file namespace and controls user access to files. The HDFS architecture is such that user data is not mounted on NameNode. These data are only mounted on NameNodes.
DataNode
Subsidiary nodes (slaves) are in HDFS. Unlike NameNode, there is a commodity hardware, that is, a cheap system that does not have high quality. The DataNode is a block server that stores data in an ex3 or ex4 local file.
Blocks
Blocks are nothing but the smallest sequential location on your hard disk, where the data is stored. In general, in each file system, you save the data as a set of blocks. Similarly, HDFS stores each file in blocks that are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) that you can configure as you need it.
It is not necessary to save each file in HDFS to the exact size of the configured block. Let's take an example from the example.txt file of 514 MB in the form shown above. Suppose we use the default block size setting, which is 128 MB. So, how many blocks will be created? 5. The first four blocks are 128 MB. But, the last block is only 2 megabytes.
Well, whenever we talk about HDFS, we talk about large data sets, namely, Terabyte and Petabyte of data. So, if we use a block size of 4 KB, like the Linux file system, we need to have a lot of blocks and over-the-counter metadata. Therefore, managing this number of blocks and metadata generates huge overheads, and that's what we do not want.