Hadoop — Distributed File System(HDFS)
Shoukath Ali Shaik
MSc in Data Science @ Indiana University Bloomington | 2x Microsoft Azure Certified | Aspiring Data Engineer | Data Scientist | Big Data Developer | Software Engineer | PySpark | GenAI | MLOps
High-level Overview, Focusing on Distributed Storage Architecture.
Large organizations have a typical problem of storing and processing huge data ( historical data or merging multiple data sources), look at Hadoop as a solution.
Hadoop is an open-source framework that specializes in handling the challenges of large-scale data processing.
It stores huge data and performs big data processing, allowing organizations to efficiently manage and derive insights from massive datasets.
Framework — is an ecosystem or combination of multiple tools and technologies.
Hadoop consists of 3 core components —
Storage | Compute | Resource Manager
HDFS | MapReduce| YARN
Hadoop distributed file system — Distributed Storage
Map Reduce — Distributed Processing
Why MapReduce?
As data is stored in a distributed environment, conventional computing resources don’t work.
YARN ( Yet another resource negotiator) — Resource manager or negotiator
Hadoop works in distributed environments or clusters. Imagine multiple computational resources working together to solve the problem. Introduces parallelism, which helps in faster computation across huge data, by implementing the Master-Slave or Primary-Secondary Architecture (Scroll below).
In HDFS, Data is generally stored in the form of a Block, the default size of a block — is 128MB (Can be configured, depending on the use case).
Pros and cons of increasing or decreasing block size —
When block size increases —
Pros — Less burden on the Name Node.
Cons — Compromise on parallelism.
When block size decreases —
Pros — Higher parallelism.
Cons — More burden on the Name Node. ( This is resolved in newer versions of Hadoop — Name Node federation is known as having more than one Name Node to handle the growing metadata) With this, we can stop compromising on parallelism.
领英推荐
Master-slave architecture —
Architecture generally has one master node and multiple data nodes.
Master Node — Stores the mapping or metadata of data. ( Like where the actual data or block is stored in the data node)
Data Node — Stores the actual data.
Rack — is a physical infrastructure, that consists of a bunch of nodes or compute servers, it is capable of scaling horizontally while defining a cluster.
Racks are generally placed across different geo-locations, to prevent data loss in case of natural calamity.
Note — Data is replicated across at least two different racks located in different geographical data centers to enhance resilience and prevent data loss.
Cluster refers to a collection of interconnected computers, or nodes, that work together to store and process large volumes of data.
Characteristics of a cluster — Interconnected, Scalable, load balancing, reliable, parallel processing.
Advantages of Cluster Architecture:
What if the data node fails?
We know that by default data blocks are replicated 3 times across the data nodes. What if the rack fails, due to natural calamity? As mentioned the data blocks are replicated 3 times, there would be at least one replica located at a different rack having a different geo-location.
Common approach — A rack stores one copy of data and the other 2 copies are stored at other racks, having different geo-location.
This approach has the best data storage strategy — Consisting of multiple Copies of Data, Rack-Level Distribution, and Geographical Distribution.
Finally, We'll wrap up the Hadoop - HDFS blog, Hope you like it! ??
About me -
Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.
If you have any queries or suggestions, please feel free to reach out to me at [email protected]
Connect me on LinkedIn?—?https:// www.dhirubhai.net/in/shoukath-ali-b6650576/
Disclaimer -
The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.