HDFS stands for Hadoop Distributed File System. It's a distributed file system that's part of Hadoop. HDFS is designed to store and manage large datasets. It's used for big data processing, where it provides fault-tolerant storage.?
- Fault-tolerant: HDFS can detect and automatically recover faults that occur on any of the machines.?
- High throughput: HDFS is designed to store and scan millions of rows of data.?
- Cost-effective: HDFS can be built on commodity hardware, which is low-priced and easily available.?
- Parallel processing: HDFS ensures parallel processing and optimized data storage.?
- Data locality: HDFS moves the processing unit to the data, rather than the data to the processing unit.?
- Name Node: The central controller of HDFS that manages the file system namespace and regulates access to files.?
- Data Node: A slave node that stores the data in the local file ext3 or ext4.?
HDFS divides files into blocks and stores each block on a Data Node.?
Nodes: Master-slave nodes typically forms the HDFS cluster.?
- Name Node (Master Node):?Manages all the slave nodes and assign work to them. It executes filesystem namespace operations like opening, closing, renaming files and directories. It should be deployed on reliable hardware which has the high config. not on commodity hardware.
- Data Node(Slave Node):?Actual worker nodes, who do the actual work like reading, writing, processing etc. Hey also perform creation, deletion, and replication upon instruction from the master. Hey can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.?
- Name nodes:?Run on the master node. Store metadata (data about data) like file path, the number of blocks, block Ids. etc. Require high amount of RAM. Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk.
- DataModes:?Run on slave nodes. Require high memory as data is actually stored here.
- HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
- Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
- Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.?
- Distributed data storage.
- Blocks reduce seek time.
- The data is highly available as the same block is present at multiple datanodes.
- Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
- High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.?
- Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
- Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.