登录查看更多内容

Basics of Hadoop

Vivek Bansal

Senior Software Engineer at Uber | Ex-Grab | Ex-Directi

发布日期: 2024年4月7日

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched article every week delivered straight to your inbox.

What is Hadoop?

As per Wikipedia, “Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.”

In terms of storage, Hadoop manages data up to terabytes or even petabytes (1 Petabyte = 1000 Terabytes). Hadoop is designed to scale from one machine to thousands of machines forming multiple clusters with each machine offering local computation and storage.

This framework was first designed by Apache and hence it’s called Apache Hadoop.

Please note that using Hadoop is a very generic word these days. Hadoop is a big open-source framework that is used as baseline and various technologies are built on top of Hadoop or can be used alongside Hadoop such as Hive, HBase, Spark, Zookeeper, Storm, etc.

Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.

4 Key Hadoop Modules

Four key modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:

Hadoop Distributed File System: HDFS is the primary storage system used by Hadoop applications. HDFS is a distributed file system in which every individual node performs operations on its local storage dataset. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on the high throughput of data access rather than the low latency of data access.
Yet Another Resource Negotiator: YARN is a resource-management platform responsible for managing the compute resources(for example: RAM, CPU, GPU, etc) across the clusters and using these resources to efficiently schedule users’ applications across the Hadoop ecosystem.
MapReduce: This is a programming model for processing data at a large scale. In MapReduce, small subsets of a large dataset and instructions for processing the subsets are sent to multiple machines in the Hadoop cluster. Each node processes the subset of data along with the instructions sent to it parallelly along with other nodes and then combines the dataset into a smaller more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules

While the primary Hadoop framework is written in Java language, you can use Hadoop to write MapReduce programs in other languages like C++/Python and then Hadoop will take care of converting your code to Java understandable format and then interact with the core framework

How does Hadoop Work?

Hadoop distributes the datasets across a cluster of commodity hardware. Commodity hardware is any physical hardware that can be easily replaced with any other hardware irrespective of its origin or manufacturer. Processing is performed in parallel on multiple servers simultaneously.

Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.

All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework.

HDFS Architecture

Block Ops(operations performed on the block - a single unit of storage in HDFS)

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.

Let’s try to understand each component of the HDFS architecture:

1. NameNode (Master):

The NameNode acts as the central authority (like the brain), managing the file system namespace. This namespace directory structure keeps track of all files and folders within the HDFS. It regulates access to files by clients, ensuring data security.

It doesn't store actual data itself; it just manages where data resides on DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories

2. DataNode (Slaves):

The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. Also, the DataNodes report back to the NameNode periodically, keeping it updated on block health.

领英推荐

Why do we need Hadoop for Data Science - NareshIT

Naresh i Technologies 2 年前

The Evolution of Apache Hadoop: A Revolutionary Big…

Sachin D N ???? 1 年前

Introduction to Hadoop

Simran Rai 1 个月前

3. Blocks:

Files in HDFS are split into fixed-size blocks (configurable) for efficient storage and distribution. Blocks are the fundamental unit of storage in HDFS.

4. Replication:

To ensure fault tolerance and data availability, HDFS replicates each block across multiple DataNodes. The number of replicas (replication factor) is configurable. In case a DataNode fails, the data remains accessible from other replicas. That’s why, HDFS is considered to be fault-tolerant.

The NameNode makes all decisions regarding the replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

5. Secondary NameNode (Optional):

Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store this data in a file name fsimage. This file then gets transferred to a new system.

FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.

A typical file in HDFS is gigabytes to terabytes in size and thus HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. “HDFS provides interfaces for applications to move themselves closer to where the data is located”.

Hadoop Tools

Many open-source tools extend the capabilities of the core Hadoop modules:

Apache Hive: The Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in HDFS using HiveQL(Hive Query Language), which is similar to SQL.
Apache HBase: Apache HBase is the Hadoop database, a distributed, scalable, big data store. This is a NoSQL database that sits on top of HDFS. It's built for fast read/write access to large datasets, especially when you need to access specific records quickly. HBase uses a column-oriented format for data storage, allowing for efficient retrieval of individual records.
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data and enables functions like filter, sort, load, and join
Apache Zookeeper: An open-source server that enables reliable distributed coordination in Hadoop; a service for, "maintaining configuration information, naming, providing distributed synchronization, and providing group services"
Apache Oozie: This is a workflow management system that helps schedule and coordinate Hadoop jobs. It allows you to define dependencies between jobs and automate the execution process.

Presto(Not a part of Hadoop Tech Umbrella): Acts as a distributed SQL query engine. It can sit on top of Hadoop, allowing users to run interactive SQL queries against data stored in HDFS. Presto doesn't manage its own storage system and relies on connectors to access data from various sources, including HDFS.

Benefits of using Hadoop

Scalability: Hadoop can easily scale up by adding more nodes to a cluster, allowing it to handle ever-growing data volumes.
Cost-Effectiveness: Hadoop leverages commodity hardware, making it a relatively inexpensive solution for big data storage and processing.
Fault Tolerance: Hadoop allows for fault tolerance and system resilience, meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on one Hadoop cluster is replicated across other nodes within the system to fortify against the possibility of hardware or software failure.
Flexibility: Hadoop allows for flexibility in data storage as data does not require preprocessing before storing it which means that an organization can store as much data as they like and then utilize it later.

That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!

Book exclusive 1:1 with me here.

Thanks for reading Curious Engineer! Subscribe for free to receive new posts and support my work.

Resources

HDFS Architecture by Apache

What is Hadoop by Google Cloud

Apache Hadoop by Wikipedia

Hadoop Distributed File System by Databricks

Curious Engineer

19,866 位关注者

要查看或添加评论，请登录

Vivek Bansal的更多文章

How to implement a Circuit Breaker

2025年1月12日

How to implement a Circuit Breaker

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

7 条评论
How to implement Consistent Hashing

2024年12月29日

How to implement Consistent Hashing

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

3 条评论
Optimistic Locking Implementation

2024年12月1日

Optimistic Locking Implementation

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…
1 year to Curious Engineer ??

2024年11月18日

1 year to Curious Engineer ??

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…
Message Queues vs Message Brokers

2024年11月9日

Message Queues vs Message Brokers

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

4 条评论
Introduction to gRPC

2024年11月2日

Introduction to gRPC

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

7 条评论
Non-Functional Requirements

2024年10月26日

Non-Functional Requirements

Brief Introduction Let’s say you are building a website that allows users to book flight tickets. The requirements for…

4 条评论
QuadTrees

2024年10月19日

QuadTrees

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

2 条评论
Text Based Search: ElasticSearch

2024年10月12日

Text Based Search: ElasticSearch

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

3 条评论
Sharding vs Partitioning

2024年10月5日

Sharding vs Partitioning

If you like the free content I put out, consider subscribing to my newsletter on substack to get a well-researched…

5 条评论

See all articles

Basics of Hadoop

Vivek Bansal

Senior Software Engineer at Uber | Ex-Grab | Ex-Directi

What is Hadoop?

4 Key Hadoop Modules

How does Hadoop Work?

HDFS Architecture

领英推荐

Hadoop Tools

Benefits of using Hadoop

Resources

Curious Engineer

19,866 位关注者

Vivek Bansal的更多文章

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Unleashing the Power of Big Data with Hadoop

?? Hadoop Made Easy: Fix Common Errors and Install it Like a Pro!"

Understanding Hadoop: The Backbone of Big Data Processing

APACHE HADOOP & HDFS

Now You’re a Hadoop Expert

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Hadoop vs Hive

Apache Hadoop vs Apache Spark

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time

What is Hadoop?

4 Key Hadoop Modules

How does Hadoop Work?

HDFS Architecture

领英推荐

Hadoop Tools

Benefits of using Hadoop

Resources

Curious Engineer

19,866 位关注者

Vivek Bansal的更多文章

How to implement a Circuit Breaker

How to implement Consistent Hashing

Optimistic Locking Implementation

1 year to Curious Engineer ??

Message Queues vs Message Brokers

Introduction to gRPC

Non-Functional Requirements

QuadTrees

Text Based Search: ElasticSearch

Sharding vs Partitioning

社区洞察

其他会员也浏览了

Hadoop Ecosystem and Their Components

Unleashing the Power of Big Data with Hadoop

?? Hadoop Made Easy: Fix Common Errors and Install it Like a Pro!"

Understanding Hadoop: The Backbone of Big Data Processing

APACHE HADOOP & HDFS

Now You’re a Hadoop Expert

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Hadoop vs Hive

Apache Hadoop vs Apache Spark

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time