登录查看更多内容

HADOOP HDFS

Lavanya Narang

Engineer Trainee(Ui Path) @MTSL | Engineering (IT) | Python | SQL | Cloud Solution | Data Enthusiast

发布日期: 2024年6月10日

In this article, we'll dive deeper into the Hadoop Distributed File System (HDFS), focusing on its intricate mechanisms. We'll explore its write and read operations, data storage architecture, data pipeline, and fault tolerance features. Understanding these technical aspects is crucial for effectively utilizing HDFS in large-scale data processing and analytics tasks. Let's unravel the inner workings of HDFS and how it ensures efficient and reliable management of massive datasets in distributed computing environments.

Read Mechanism in HDFS

The read process in HDFS involves several steps to retrieve data stored across multiple nodes:

Client Request: The client interacts with the HDFS client library to read a file.
Namenode Interaction: The client requests the namenode to fetch the metadata of the file, which includes the block locations.
Block Location: The namenode responds with the block locations and the datanodes that store the replicas of these blocks.
Data Retrieval: The client connects to the closest datanode to start reading the data blocks.
Sequential Read: If the file is large and spans multiple blocks, the client continues to read from different datanodes as specified by the namenode.

Detailed Steps of Read Operation:

Open File: The client calls the open() method on the FileSystem object to get the FSDataInputStream.
Request Metadata: The client sends an open request to the namenode, which includes the filename.
Namenode Response: The namenode looks up the metadata for the file and returns the block locations (list of blocks and datanodes).
Read Blocks:

Block 1: The client connects to the first datanode that holds the first block and reads the data.
Block 2: After finishing Block 1, the client connects to the datanode holding the second block and reads the data.

Error Handling: If a datanode fails, the client connects to another datanode holding a replica of the block.
Checksum Verification: The client verifies checksums to ensure data integrity. If a mismatch is detected, it reads from another replica.
Complete Read: The process continues until the client reads all blocks of the file.

Write Mechanism in HDFS

The write process in HDFS is designed to ensure data reliability and integrity:

Client Request: The client requests the namenode to create a new file.
Namenode Response: The namenode checks for file existence (to avoid duplicates) and assigns a block ID.
Data Block Division: The client splits the data into smaller chunks, typically 128 MB or 256 MB.
Block Assignment: The namenode assigns a list of datanodes for each block, which will store the replicas.
Write Pipeline: The client writes the data to the first datanode, which then forwards it to the next datanode, forming a pipeline until the final datanode in the replication chain.
Acknowledgment: Each datanode sends an acknowledgment back to the previous node and ultimately to the client.

Detailed Steps of Write Operation:

Create File: The client calls the create() method on the FileSystem object to create a new file in HDFS.
Request Block Locations: The client sends a request to the namenode to create the file and obtain block locations.
Namenode Assigns Blocks: The namenode assigns blocks and provides the list of datanodes for replication.
Write to Pipeline:

Pipeline Setup: The client streams the first block to the first datanode in the pipeline.
Data Transfer: The first datanode writes the block to its local storage and forwards the block to the second datanode.
Replication: The second datanode writes the block to its local storage and forwards it to the third datanode, completing the replication process.

5. Acknowledgment Process:

Block Storage: Each datanode stores the block and sends an acknowledgment back through the pipeline.
Client Confirmation: The client receives the final acknowledgment, confirming the block's successful write.

6. Next Block: The client proceeds to the next block and repeats the process until the entire file is written.

Write Pipeline and Acknowledgment:

Write Pipeline: The client writes the data to the first datanode, which then forwards it to the second datanode, and so on, creating a pipeline.
Acknowledgment: Each datanode sends an acknowledgment back to the previous node and ultimately to the client once the data block is successfully written and replicated.

Example Sequence for Read:

Client requests to read file.txt.
Namenode responds with metadata: Block1 (Datanode1, Datanode2), Block2 (Datanode3, Datanode4).
Client connects to Datanode1 to read Block1.
Client reads Block1 and verifies checksum.
Client connects to Datanode3 to read Block2.
Client reads Block2 and verifies checksum.
Client assembles the complete file data from the blocks.

Example Sequence for Write:

Client requests to write file.txt.
Namenode assigns Block1 to Datanode1, Datanode2, and Datanode3.
Client writes Block1 to Datanode1.
Datanode1 stores Block1 and forwards it to Datanode2.
Datanode2 stores Block1 and forwards it to Datanode3.
Datanode3 stores Block1.
Acknowledgments flow back from Datanode3 to Datanode2, from Datanode2 to Datanode1, and finally from Datanode1 to the client.
Client proceeds with Block2 and repeats the process.

This read and write mechanism ensures that HDFS provides reliable, scalable, and efficient access to large datasets in a distributed computing environment.

Data Block in HDFS

HDFS stores files by dividing them into blocks:

领英推荐

Copy of Understanding the Hadoop Distributed File…

Sandhya Karki 3 个月前

Hadoop Ecosystem

Omar Khaled 4 个月前

What is the future of Hadoop?

Naveen Joshi 8 年前

Default Block Size: By default, a data block in HDFS is 128 MB or 256 MB, but this can be configured.
Fixed Size: Each file is split into fixed-size blocks, which simplifies storage management and helps in handling large files efficiently.
Distributed Storage: These blocks are distributed across the nodes in a Hadoop cluster.
Parallel Processing: Blocks can be processed in parallel across different nodes, enhancing performance and speed.

Fault Tolerance and Replication in HDFS

HDFS ensures data reliability and fault tolerance through replication:

Replication Factor: Each block of data is replicated across multiple datanodes (default replication factor is 3).
Rack Awareness: Replicas are stored on different racks to ensure data availability even in case of a rack failure. Typically, one replica is stored on a different rack to avoid single points of failure.
Heartbeat and Block Reports: Datanodes send periodic heartbeats and block reports to the namenode to confirm their status and the blocks they are storing.
Re-replication: If a datanode fails, the namenode detects the missing blocks and initiates replication to maintain the specified replication factor.

Write Pipeline in HDFS

The write pipeline ensures efficient data distribution and replication:

Client Initiates Write: The client initiates the write operation by contacting the namenode.
Pipeline Formation: The namenode returns the datanodes that will store the block replicas. These datanodes form a pipeline.
Data Streaming: The client streams the data to the first datanode in the pipeline.
Pipeline Forwarding:

First Datanode: The first datanode stores the block and forwards the data to the second datanode in the pipeline.
Second Datanode: The second datanode receives the data, stores the block, and forwards it to the third datanode.
Third Datanode: The third datanode receives the data and stores the block, completing the replication process.

Acknowledgment in Write Pipeline

Acknowledgments ensure data integrity and successful writes:

Sequential Acknowledgment: After a datanode stores a block, it sends an acknowledgment back to the previous datanode in the pipeline.
Client Acknowledgment: The acknowledgment travels back through the pipeline from the last datanode to the client.
Success Confirmation: When the client receives acknowledgments from all the datanodes in the pipeline, it considers the write operation successful.
Error Handling: If any datanode fails to send an acknowledgment, the client retries the write operation or the namenode chooses new datanodes to complete the replication.

Detailed Steps of Write Operation:

Client Requests to Create a File: The client calls the create() method on the FileSystem object to create a new file in HDFS.
Namenode Assigns Block Locations: The client sends a request to the namenode to create the file and obtain block locations.
Namenode Assigns Blocks: The namenode allocates blocks and provides the list of datanodes for replication.
Data Streaming and Replication:

Client Writes to First Datanode: The client streams the block to the first datanode.
Pipeline Forwarding: The first datanode writes the block to its local storage and forwards the block to the second datanode.
Second Datanode: The second datanode writes the block to its local storage and forwards it to the third datanode.
Third Datanode: The third datanode writes the block to its local storage, completing the replication process.

5. Acknowledgment Process:

Block Storage: Each datanode sends an acknowledgment back to the previous datanode after storing the block.
Client Confirmation: The client receives the final acknowledgment, confirming the block's successful write.

6. Next Block: The client proceeds to the next block and repeats the process until the entire file is written.

Example Sequence for Write:

Client Request: Client requests to write file.txt.
Namenode Assignment: Namenode assigns Block1 to Datanode1, Datanode2, and Datanode3.
Client Writes Block1: Client writes Block1 to Datanode1.
Pipeline Forwarding:

First Datanode: Datanode1 stores Block1 and forwards it to Datanode2.
Second Datanode: Datanode2 stores Block1 and forwards it to Datanode3.
Third Datanode: Datanode3 stores Block1.

Acknowledgment Process:

Datanode to Client: Acknowledgments flow back from Datanode3 to Datanode2, from Datanode2 to Datanode1, and finally from Datanode1 to the client.

Completion: Client proceeds with Block2 and repeats the process.

This write mechanism ensures that data is reliably written to HDFS with proper replication, providing high availability and fault tolerance.

Kushagra Mishra

6 Months Exp. Front End Developer ?? HTML, CSS, JS, SQL, PHP, Python | Finance ??

8 个月

Interesting! Looking forward to diving deeper into HDFS with your insights! ????

要查看或添加评论，请登录

Lavanya Narang的更多文章

Introduction

2024年6月3日

Introduction

Imagine a giant jigsaw puzzle that is too big to fit on one table. To solve this, you spread the puzzle pieces across…

1 条评论
Introduction:

2024年5月27日

Introduction:

Imagine you have a huge pile of Lego bricks that you want to build into a magnificent castle. Building it alone would…
Introduction:

2024年5月20日

Introduction:

Big data is not just about size; it's about harnessing the complexity and variety of data to make informed decisions…
How Big Data Drives Tailored Online Experiences

2024年5月6日

How Big Data Drives Tailored Online Experiences

This Next in Personalization 2021 Report reveals that companies who excel at demonstrating customer intimacy generate…
Navigating in the World of Cryptography

2024年4月29日

Navigating in the World of Cryptography

What is Cryptography? Cryptography is a multifaceted field of study and practice that revolves around securing…
Exploring Python Packages and PyPI

2024年4月22日

Exploring Python Packages and PyPI

Understanding PyPI: The Python Package Index PyPI, short for the Python Package Index, is the central repository for…

2 条评论
Power of Meta programming

2024年4月15日

Power of Meta programming

Introduction: Welcome to the fascinating realm of meta programming in Python—a journey where code can manipulate and…
Unlocking Advanced SQL Skills: A Comprehensive Journey from Novice to Expert

2024年4月8日

Unlocking Advanced SQL Skills: A Comprehensive Journey from Novice to Expert

Introduction: Embark on an enriching journey to amplify your SQL prowess from foundational to advanced levels. SQL…
Mastering Advanced SQL Techniques: A Beginner's Guide

2024年4月1日

Mastering Advanced SQL Techniques: A Beginner's Guide

Introduction: SQL (Structured Query Language) is a powerful tool for managing and manipulating data in relational…
Exploring the Versatile World of Linux: Applications, Comparisons, and Contributions

2024年3月25日

Exploring the Versatile World of Linux: Applications, Comparisons, and Contributions

Introduction: In this article, we delve into the multifaceted landscape of Linux, exploring its applications in various…

2 条评论

See all articles

HADOOP HDFS

Lavanya Narang

Engineer Trainee(Ui Path) @MTSL | Engineering (IT) | Python | SQL | Cloud Solution | Data Enthusiast

Read Mechanism in HDFS

Detailed Steps of Read Operation:

Write Mechanism in HDFS

Detailed Steps of Write Operation:

Write Pipeline and Acknowledgment:

Example Sequence for Read:

Example Sequence for Write:

Data Block in HDFS

领英推荐

Fault Tolerance and Replication in HDFS

Write Pipeline in HDFS

Acknowledgment in Write Pipeline

Detailed Steps of Write Operation:

Example Sequence for Write:

Lavanya Narang的更多文章

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Hadoop File Formats, when and what to use?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop Distributed File Storage

Read Mechanism in HDFS

Detailed Steps of Read Operation:

Write Mechanism in HDFS

Detailed Steps of Write Operation:

Write Pipeline and Acknowledgment:

Example Sequence for Read:

Example Sequence for Write:

Data Block in HDFS

领英推荐

Fault Tolerance and Replication in HDFS

Write Pipeline in HDFS

Acknowledgment in Write Pipeline

Detailed Steps of Write Operation:

Example Sequence for Write:

Lavanya Narang的更多文章

Introduction

Introduction:

Introduction:

How Big Data Drives Tailored Online Experiences

Navigating in the World of Cryptography

Exploring Python Packages and PyPI

Power of Meta programming

Unlocking Advanced SQL Skills: A Comprehensive Journey from Novice to Expert

Mastering Advanced SQL Techniques: A Beginner's Guide

Exploring the Versatile World of Linux: Applications, Comparisons, and Contributions

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Hadoop File Formats, when and what to use?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop Distributed File Storage