登录查看更多内容

HDFS

Rohit Singh

Associate Project Manager @ HuQuo

发布日期: 2025年2月3日

HDFS stands for Hadoop Distributed File System. It's a distributed file system that's part of Hadoop. HDFS is designed to store and manage large datasets. It's used for big data processing, where it provides fault-tolerant storage.?

Features of HDFS

Fault-tolerant: HDFS can detect and automatically recover faults that occur on any of the machines.?
High throughput: HDFS is designed to store and scan millions of rows of data.?
Cost-effective: HDFS can be built on commodity hardware, which is low-priced and easily available.?
Parallel processing: HDFS ensures parallel processing and optimized data storage.?
Data locality: HDFS moves the processing unit to the data, rather than the data to the processing unit.?

Components of HDFS

Name Node: The central controller of HDFS that manages the file system namespace and regulates access to files.?
Data Node: A slave node that stores the data in the local file ext3 or ext4.?

HDFS divides files into blocks and stores each block on a Data Node.?

Nodes: Master-slave nodes typically forms the HDFS cluster.?

Name Node (Master Node):?Manages all the slave nodes and assign work to them. It executes filesystem namespace operations like opening, closing, renaming files and directories. It should be deployed on reliable hardware which has the high config. not on commodity hardware.
Data Node(Slave Node):?Actual worker nodes, who do the actual work like reading, writing, processing etc. Hey also perform creation, deletion, and replication upon instruction from the master. Hey can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.?

领英推荐

Copy of Understanding the Hadoop Distributed File…

Sandhya Karki 3 个月前

Hadoop vs. Snowflake: Which One is Better

DrighnaTech 8 个月前

Is Hadoop Sinking with the Emergence of AI & Machine…

Anjan Kumar Ayyadapu 9 个月前

Name nodes:?Run on the master node. Store metadata (data about data) like file path, the number of blocks, block Ids. etc. Require high amount of RAM. Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk.
DataModes:?Run on slave nodes. Require high memory as data is actually stored here.

Terms related to HDFS:??

HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.?

Features:??

Distributed data storage.
Blocks reduce seek time.
The data is highly available as the same block is present at multiple datanodes.
Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
High fault tolerance.

Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.?

Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.

要查看或添加评论，请登录

Rohit Singh的更多文章

Matillion

2025年3月22日

Matillion

Matillion is a cloud-native data integration platform that simplifies and accelerates the ELT (Extract, Load…
Azure Blob storage

2025年3月21日

Azure Blob storage

Blob storage is a type of cloud storage for unstructured data, like images, videos, or documents, where data is stored…
BI Testing

2025年3月20日

BI Testing

BI testing, or Business Intelligence testing, verifies and validates the accuracy and reliability of insights delivered…
Amazon Elastic Container Service (Amazon ECS)

2025年3月19日

Amazon Elastic Container Service (Amazon ECS)

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the…
User Acceptance Testing (UAT)

2025年3月18日

User Acceptance Testing (UAT)

User Acceptance Testing (UAT) is a crucial phase in software testing where the software is tested in a real-world…
Software Development Engineer in Test (SDET)

2025年3月17日

Software Development Engineer in Test (SDET)

Software Development Engineer in Test (SDET) is a developer with the primary responsibility for the development of…

1 条评论
Data center

2025年3月15日

Data center

A data center is essentially a building or a dedicated space within a building that serves as a central hub for…
Network security engineer

2025年3月13日

Network security engineer

A Network and Security Engineer designs, implements, and maintains secure network infrastructure, protecting systems…
Firewall

2025年3月12日

Firewall

A firewall is a network security device either hardware or software-based which monitors all incoming and outgoing…
Apache Sqoop

2025年3月11日

Apache Sqoop

Apache Sqoop is a command-line tool that transfers data between relational databases and Hadoop. It's used to import…

See all articles

HDFS

Rohit Singh

Associate Project Manager @ HuQuo

领英推荐

Rohit Singh的更多文章

社区洞察

其他会员也浏览了

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is cloud replacing Hadoop?

Hadoop File Formats, when and what to use?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop Distributed File Storage

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

All about BIG data

HDFS (Hadoop Distributed File System):

Understanding What Data is Stored in the Name Node

Introduction to Hadoop

领英推荐

Rohit Singh的更多文章

Matillion

Azure Blob storage

BI Testing

Amazon Elastic Container Service (Amazon ECS)

User Acceptance Testing (UAT)

Software Development Engineer in Test (SDET)

Data center

Network security engineer

Firewall

Apache Sqoop

社区洞察

其他会员也浏览了

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is cloud replacing Hadoop?

Hadoop File Formats, when and what to use?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop Distributed File Storage

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

All about BIG data

HDFS (Hadoop Distributed File System):

Understanding What Data is Stored in the Name Node

Introduction to Hadoop