Basics of Hadoop
If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched article every week delivered straight to your inbox.
What is Hadoop?
As per Wikipedia, “Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.”
In terms of storage, Hadoop manages data up to terabytes or even petabytes (1 Petabyte = 1000 Terabytes). Hadoop is designed to scale from one machine to thousands of machines forming multiple clusters with each machine offering local computation and storage.
This framework was first designed by Apache and hence it’s called Apache Hadoop.
Please note that using Hadoop is a very generic word these days. Hadoop is a big open-source framework that is used as baseline and various technologies are built on top of Hadoop or can be used alongside Hadoop such as Hive, HBase, Spark, Zookeeper, Storm, etc.
Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.
4 Key Hadoop Modules
Four key modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:
While the primary Hadoop framework is written in Java language, you can use Hadoop to write MapReduce programs in other languages like C++/Python and then Hadoop will take care of converting your code to Java understandable format and then interact with the core framework
How does Hadoop Work?
Hadoop distributes the datasets across a cluster of commodity hardware. Commodity hardware is any physical hardware that can be easily replaced with any other hardware irrespective of its origin or manufacturer. Processing is performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework.
HDFS Architecture
Block Ops(operations performed on the block - a single unit of storage in HDFS)
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Let’s try to understand each component of the HDFS architecture:
1. NameNode (Master):
The NameNode acts as the central authority (like the brain), managing the file system namespace. This namespace directory structure keeps track of all files and folders within the HDFS. It regulates access to files by clients, ensuring data security.
It doesn't store actual data itself; it just manages where data resides on DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories
2. DataNode (Slaves):
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. Also, the DataNodes report back to the NameNode periodically, keeping it updated on block health.
领英推荐
3. Blocks:
Files in HDFS are split into fixed-size blocks (configurable) for efficient storage and distribution. Blocks are the fundamental unit of storage in HDFS.
4. Replication:
To ensure fault tolerance and data availability, HDFS replicates each block across multiple DataNodes. The number of replicas (replication factor) is configurable. In case a DataNode fails, the data remains accessible from other replicas. That’s why, HDFS is considered to be fault-tolerant.
The NameNode makes all decisions regarding the replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
5. Secondary NameNode (Optional):
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store this data in a file name fsimage. This file then gets transferred to a new system.
FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.
A typical file in HDFS is gigabytes to terabytes in size and thus HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. “HDFS provides interfaces for applications to move themselves closer to where the data is located”.
Hadoop Tools
Many open-source tools extend the capabilities of the core Hadoop modules:
Presto(Not a part of Hadoop Tech Umbrella): Acts as a distributed SQL query engine. It can sit on top of Hadoop, allowing users to run interactive SQL queries against data stored in HDFS. Presto doesn't manage its own storage system and relies on connectors to access data from various sources, including HDFS.
Benefits of using Hadoop
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Book exclusive 1:1 with me here.
Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.