登录查看更多内容

Hadoop — Distributed File System(HDFS)

Shoukath Ali Shaik

MSc in Data Science @ Indiana University Bloomington | 2x Microsoft Azure Certified | Aspiring Data Engineer | Data Scientist | Big Data Developer | Software Engineer | PySpark | GenAI | MLOps

发布日期: 2024年5月25日

High-level Overview, Focusing on Distributed Storage Architecture.

Large organizations have a typical problem of storing and processing huge data ( historical data or merging multiple data sources), look at Hadoop as a solution.

Hadoop is an open-source framework that specializes in handling the challenges of large-scale data processing.

It stores huge data and performs big data processing, allowing organizations to efficiently manage and derive insights from massive datasets.

Framework — is an ecosystem or combination of multiple tools and technologies.

Hadoop consists of 3 core components —

Storage | Compute | Resource Manager

HDFS | MapReduce| YARN

Hadoop distributed file system — Distributed Storage

Map Reduce — Distributed Processing

Why MapReduce?

As data is stored in a distributed environment, conventional computing resources don’t work.

YARN ( Yet another resource negotiator) — Resource manager or negotiator

Hadoop works in distributed environments or clusters. Imagine multiple computational resources working together to solve the problem. Introduces parallelism, which helps in faster computation across huge data, by implementing the Master-Slave or Primary-Secondary Architecture (Scroll below).

In HDFS, Data is generally stored in the form of a Block, the default size of a block — is 128MB (Can be configured, depending on the use case).

Pros and cons of increasing or decreasing block size —

Block Size VS Parallelism VS Burden on Name Node

When block size increases —

Pros — Less burden on the Name Node.

Cons — Compromise on parallelism.

When block size decreases —

Pros — Higher parallelism.

Cons — More burden on the Name Node. ( This is resolved in newer versions of Hadoop — Name Node federation is known as having more than one Name Node to handle the growing metadata) With this, we can stop compromising on parallelism.

领英推荐

Data Analysis Using Apache Hadoop and Apache Spark

Tharani Vadde 2 年前

Hadoop Distributed File Storage

Ayush Srivastava 1 年前

Hadoop vs Spark: Which Big Data Framework is the Best…

Rakesh V. 1 年前

Master-slave architecture —

Architecture generally has one master node and multiple data nodes.

Master Node — Stores the mapping or metadata of data. ( Like where the actual data or block is stored in the data node)

Data Node — Stores the actual data.

Rack — is a physical infrastructure, that consists of a bunch of nodes or compute servers, it is capable of scaling horizontally while defining a cluster.

Racks are generally placed across different geo-locations, to prevent data loss in case of natural calamity.

Note — Data is replicated across at least two different racks located in different geographical data centers to enhance resilience and prevent data loss.

Cluster refers to a collection of interconnected computers, or nodes, that work together to store and process large volumes of data.

Characteristics of a cluster — Interconnected, Scalable, load balancing, reliable, parallel processing.

Advantages of Cluster Architecture:

Rack Awareness: Nodes are often organized into racks to optimize network traffic. HDFS uses this information to replicate data across different racks, ensuring data reliability and improving fault tolerance.
High Availability: The latest HDFS often includes a secondary NameNode or standby NameNode to provide failover capabilities, ensuring the system remains operational if the primary NameNode fails.

What if the data node fails?

We know that by default data blocks are replicated 3 times across the data nodes. What if the rack fails, due to natural calamity? As mentioned the data blocks are replicated 3 times, there would be at least one replica located at a different rack having a different geo-location.

Common approach — A rack stores one copy of data and the other 2 copies are stored at other racks, having different geo-location.

This approach has the best data storage strategy — Consisting of multiple Copies of Data, Rack-Level Distribution, and Geographical Distribution.

Finally, We'll wrap up the Hadoop - HDFS blog, Hope you like it! ??

About me -

Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.

If you have any queries or suggestions, please feel free to reach out to me at [email protected]

Connect me on LinkedIn?—?https:// www.dhirubhai.net/in/shoukath-ali-b6650576/

Disclaimer -

The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.

要查看或添加评论，请登录

Shoukath Ali Shaik的更多文章

Apache Spark?-?Data Engineering

2024年6月28日

Apache Spark?-?Data Engineering

Overview of how data and compute resources are distributed across clusters, with optimization. Spark is an alternative…
Prompting - Prompt Engineering

2024年6月4日

Prompting - Prompt Engineering

Part one — Convincing LLM, how to generate outputs. ( Brainwash LLM models ??) As discussed in the previous blog…
LLM — Large Language Models

2024年5月21日

LLM — Large Language Models

Every language model has their own Vocabulary, also known as a group of words, which is used during the pre-training or…

5 条评论
5 V’s?—?Big Data

2024年5月16日

5 V’s?—?Big Data

What is Big Data? Big data is a descriptive definition of data (system) that cannot be stored (processed) using…

1 条评论
A Day in the Life of —a Big Data Engineer

2024年5月11日

A Day in the Life of —a Big Data Engineer

Big Data Engineer — A Big Data Engineer is a professional who specializes in preparing ‘big data’ for analytical or…

See all articles

Hadoop — Distributed File System(HDFS)