Hadoop — Distributed File System(HDFS)
Credits - Shoukath-Ali

Hadoop — Distributed File System(HDFS)

High-level Overview, Focusing on Distributed Storage Architecture.

Large organizations have a typical problem of storing and processing huge data ( historical data or merging multiple data sources), look at Hadoop as a solution.

Hadoop is an open-source framework that specializes in handling the challenges of large-scale data processing.

It stores huge data and performs big data processing, allowing organizations to efficiently manage and derive insights from massive datasets.

Hadoop Ecosystem

Framework — is an ecosystem or combination of multiple tools and technologies.

Hadoop consists of 3 core components —

Storage | Compute | Resource Manager

HDFS | MapReduce| YARN

Hadoop distributed file system — Distributed Storage

Map Reduce — Distributed Processing

Why MapReduce?
As data is stored in a distributed environment, conventional computing resources don’t work.

YARN ( Yet another resource negotiator) — Resource manager or negotiator

Hadoop Core Components

Hadoop works in distributed environments or clusters. Imagine multiple computational resources working together to solve the problem. Introduces parallelism, which helps in faster computation across huge data, by implementing the Master-Slave or Primary-Secondary Architecture (Scroll below).

In HDFS, Data is generally stored in the form of a Block, the default size of a block — is 128MB (Can be configured, depending on the use case).

Pros and cons of increasing or decreasing block size —

Block Size VS Parallelism VS Burden on Name Node

When block size increases —

Pros — Less burden on the Name Node.

Cons — Compromise on parallelism.

When block size decreases —

Pros — Higher parallelism.

Cons — More burden on the Name Node. ( This is resolved in newer versions of Hadoop — Name Node federation is known as having more than one Name Node to handle the growing metadata) With this, we can stop compromising on parallelism.

Block Size VS Parallelism

Master-slave architecture —

Architecture generally has one master node and multiple data nodes.

Master Node — Stores the mapping or metadata of data. ( Like where the actual data or block is stored in the data node)

Data Node — Stores the actual data.

Master-slave architecture

Rack — is a physical infrastructure, that consists of a bunch of nodes or compute servers, it is capable of scaling horizontally while defining a cluster.

Racks are generally placed across different geo-locations, to prevent data loss in case of natural calamity.

Note — Data is replicated across at least two different racks located in different geographical data centers to enhance resilience and prevent data loss.
Cluster Architecture

Cluster refers to a collection of interconnected computers, or nodes, that work together to store and process large volumes of data.

Characteristics of a cluster — Interconnected, Scalable, load balancing, reliable, parallel processing.

Advantages of Cluster Architecture:

  • Rack Awareness: Nodes are often organized into racks to optimize network traffic. HDFS uses this information to replicate data across different racks, ensuring data reliability and improving fault tolerance.
  • High Availability: The latest HDFS often includes a secondary NameNode or standby NameNode to provide failover capabilities, ensuring the system remains operational if the primary NameNode fails.

Data Replication

What if the data node fails?

We know that by default data blocks are replicated 3 times across the data nodes. What if the rack fails, due to natural calamity? As mentioned the data blocks are replicated 3 times, there would be at least one replica located at a different rack having a different geo-location.

Common approach — A rack stores one copy of data and the other 2 copies are stored at other racks, having different geo-location.

This approach has the best data storage strategy — Consisting of multiple Copies of Data, Rack-Level Distribution, and Geographical Distribution.

Fault Tolerance

Finally, We'll wrap up the Hadoop - HDFS blog, Hope you like it! ??

About me -

Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.

If you have any queries or suggestions, please feel free to reach out to me at [email protected]

Connect me on LinkedIn?—?https:// www.dhirubhai.net/in/shoukath-ali-b6650576/

Disclaimer -

The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.

要查看或添加评论,请登录

Shoukath Ali Shaik的更多文章

  • Apache Spark?-?Data Engineering

    Apache Spark?-?Data Engineering

    Overview of how data and compute resources are distributed across clusters, with optimization. Spark is an alternative…

  • Prompting - Prompt Engineering

    Prompting - Prompt Engineering

    Part one — Convincing LLM, how to generate outputs. ( Brainwash LLM models ??) As discussed in the previous blog…

  • LLM — Large Language Models

    LLM — Large Language Models

    Every language model has their own Vocabulary, also known as a group of words, which is used during the pre-training or…

    5 条评论
  • 5 V’s?—?Big Data

    5 V’s?—?Big Data

    What is Big Data? Big data is a descriptive definition of data (system) that cannot be stored (processed) using…

    1 条评论
  • A Day in the Life of —a Big Data Engineer

    A Day in the Life of —a Big Data Engineer

    Big Data Engineer — A Big Data Engineer is a professional who specializes in preparing ‘big data’ for analytical or…

社区洞察

其他会员也浏览了