登录查看更多内容

HDFS- The world’s most reliable storage system.

Santosh Bakliwal

Assistant Vice President at DataFlair

发布日期: 2017年1月9日

Objective

This comprehensive guide of HDFS will take you through the introduction of HDFS, what are different nodes, how data is stored in HDFS, HDFS architecture, HDFS features like distributed storage, fault tolerance, high availability, reliability, block etc. We will also discuss how to write and read data from HDFS and Rack awareness. The objective of this HDFS tutorial is to cover all the concepts of HDFS in great details.

To install Hadoop follow this installation guide.

HDFS Nodes

As we know Hadoop works in master slave fashion, HDFS also has 2 types of nodes that work in same manner. There are namenode(s) and datanodes in the cluster.

1) Master node (Also called Name node) – As the name suggests, this node manages all the slave nodes and assign work to slaves. It should be deployed on reliable hardware as it is the centerpiece of HDFS.

2) Slave node (Also called data node) – Datanodes are the slaves which are deployed on each machine and provide the actual storage. They are the actual worker nodes. These are responsible for serving read and write requests for the clients. They can be deployed on commodity hardware. If any slave node goes down, namenode automatically replicates the blocks which were present at that data node to other nodes in the cluster.

HDFS Master (Namenode)

It regulates file access to the clients. It maintains and manages the slave nodes and assign tasks to them. It executes file system namespace operations like opening, closing, and renaming files and directories. It should be deployed on reliable hardware.

HDFS Slave (Datanode)

There are a number of slaves or datanodes in HDFS which manage storage of data. These slave nodes are the actual worker nodes which do the tasks and serve read and write requests from the file system’s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode. Once a block is written on a datanode, it replicates it to other datanode and process continues till number of replicas mentioned is created. Datanodes can be deployed on commodity Hardware and we need not deploy them on very reliable hardware.

HDFS Daemons

There are 2 daemons which run for HDFS for storage of data.

Namenode: This is the daemon that runs on all the masters. Name node stores metadata like filename, number of blocks, number of replicas, location of blocks, block IDs etc. This metadata is available in memory in the master for faster retrieval of data. In local disk, a copy of metadata is available for persistence. So name node memory should be high as per the requirement.
Datanode: This is the daemon that runs on slave. These are actual worker nodes that store the data.

Data storage in HDFS

Whenever any file has to be written in HDFS, it is broken into small pieces of data known as blocks. HDFS has a default block size of 128 MB which can be increased as per the requirements. These blocks are stored in the cluster in distributed manner on different nodes. This provides a mechanism for Map Reduce to process the data in parallel in the cluster.

Multiple copies of each block are stored across the cluster on different nodes. This is replication of data. By default, HDFS has replication factor of 3. It provides fault tolerance, reliability and high availability.

Large file is split into n number of small blocks. These blocks are stored at different nodes in the cluster in a distributed manner. Each block is replicated and stored across different nodes in the cluster.

To practice HDFS commands follow this command guide.

HDFS Architecture

Namenode stores metadata and datanode which stores actual data. The client interacts with namenode for any task to be performed as namenode is the centerpiece in the cluster.

There are several datanodes in the cluster which store HDFS data in the local disk. Datanode sends heartbeat message to namenode periodically to indicate that it is alive. Also it replicates data to other datanode as per the replication factor.

Features of HDFS

Let us now see various features of HDFS. We are going to understand these features in later sections.

Distributed Storage – Data is stored in distributed manner
Blocks – Data is split into blocks
Replication – Blocks are replicated at different nodes
High Availability – Data is highly available due to replication
Data Reliability – Data is stored reliably in HDFS
Fault tolerant – Data replication provides fault tolerance feature
Scalability – Nodes in HDFS cluster can be increased on the fly
High throughput access to application – Parallel processing provides high throughput access to application

To learn more about HDFS features follow this feature guide.

Distributed storage

As HDFS stores data in a distributed manner. It divides the data into small pieces and stores it in different nodes of the cluster. In this manner, HDFS provides a way to map reduce to process subset of large data, which is broken into smaller pieces and stored in multiple nodes, parallely on several nodes. Map reduce is the heart of Hadoop but HDFS is the one which provides it all these capabilities.

Blocks

As HDFS splits huge files into small chunks known as blocks. Block is the smallest unit of data in a filesystem. We (client and admin) do not have any control on block like block location. Namenode decides all such things.

HDFS default block size is 128 MB which can be increased as per the requirement. This is unlike OS filesystem where block size is of 4 KB.

If data size is less than the block size, then block size will be equal to the data size. For example if file size is of 129 MB, then 2 blocks will be created for it. One block will be of default size 128 MB and other will be of 1 MB only and not 128 MB as it will waste the space (here block size is equal to data size). Hadoop is intelligent enough not to waste rest of 127 MB. So it is allocating 1 MB block only for 1 MB data.

Major advantage of storing data in such block size is that it saves disk seek time and other advantage is in the case of processing as mapper processes 1 block at a time. So 1 mapper processes large data at a time.

File is splitted into blocks and each block is stored at different nodes with default 3 replicas of each block. Each replica of a block is stored at different node to provide fault tolerant feature and the placement of these blocks on different node is decided by Name node. Name node makes it as much distributed as possible. While placing a block on a data node, it considers how much a particular data node is loaded at that time.

Replication

HDFS creates duplicate copies of each block. This is called replication. All blocks are replicated and stored at different nodes across the cluster. It tries to put at least 1 replica in every rack.

What do you mean by rack?

Datanodes are arranged in racks. All the nodes in a rack are connected by a single switch so if a switch or complete rack is down, data can be accessed from another rack. We will see it further in rack awareness section.

As seen earlier also, default replication factor is 3 and this can be changed to the required values according to the requirement by editing the configuration files (hdfs-site.xml)

High availability

Replication of data blocks in HDFS and storing at multiple nodes across cluster provides high availability of data. Even if a network link or node or some hardware goes down, we can easily get the data from different path or different node as data is minimally replicated at 3 nodes. This is how high availability feature is supported by HDFS. Learn more about high availability.

Data Reliability

As we have seen in high availability that data is replicated in HDFS, It is stored reliably as well. Due to replication, blocks are highly available even if some node crashes or some hardware fails. We can balance the cluster (if 1 node goes down, block that was stored at that node become under replicated or if a node which has gone down suddenly becomes active, block at that node is over replicated. We need to balance the cluster to create or destroy the replica as per the situation) in such case by making the replication factor to desired value by just running few commands. This is how data is stored reliably and provides fault tolerant and high availability.

Fault Tolerant

HDFS provides fault tolerant storage layer for Hadoop and other components in the ecosystem. HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at 3 different locations by default. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant. Learn more about fault tolerance.

Scalability

Scalability means expanding or contracting the cluster. In HDFS, scalability is done in 2 ways.

We can add more disks on nodes of the cluster.

For doing this, we need to edit the configuration files and make corresponding entries of newly added disks. Here we need to provide down time though it is very less. So people generally prefer second way of scaling which is horizontal scaling.

Other option of scalability in HDFS is of adding more nodes in the cluster on the fly without any downtime. This is known as horizontal scaling.

We can add as many nodes as we want in the cluster on the fly in real time without any downtime. This is a very unique feature provided by hadoop.

High throughput access to application data

HDFS provides high throughput access to application data. Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

File access and other operations

In hadoop, we need to interact with the file system either by programming or by command line interface (CLI).

HDFS is having lot many similarities with Linux file system. So we can do almost all the operation we can do with local file system like create directory, copy file, change permissions etc.

HDFS also provides different access rights like read, write and execute to users, groups and others.

We can browse the file system here by the browser that would be like https://master-IP:50070. By pointing the browser to this URL, you can get the cluster information like space used / available, number of live nodes, number of dead nodes etc.

File Read Operation

Whenever client wants to read any file from HDFS, the client needs to interact with namemode as Namenode is the only place which stores metadata about data nodes. Namenode specifies the address or the location of the slaves where data is stored. Client will interact with the specified data nodes and read the data from there. For security / authentication purpose, namenode provides token to the client which it shows to the data node for reading the file.

Read the complete article>>

要查看或添加评论，请登录

Santosh Bakliwal的更多文章

Data Science vs Artificial Intelligence – Eliminate your Doubts

2019年6月20日

Data Science vs Artificial Intelligence – Eliminate your Doubts

Data Science and Artificial Intelligence, are the two most important technologies in the world today. While Data…
Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

2019年6月19日

Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

Do you know – We perform 40,000 search queries every second (on Google alone), which makes it 3.5 searches per day and…
Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

2019年6月18日

Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

Data Science Case Study – How Netflix Used Data Science to Improve its Recommendation System? Do you remember the last…
7 Breathtaking Applications of Data Science in Finance

2019年6月17日

7 Breathtaking Applications of Data Science in Finance

1. Objective – Data Science Careers Today, in this tutorial of Future of Data Science, we will discuss what is Data…
Top Data Science Jobs & Roles for 2019: Find What Suits You Best

2019年6月15日

Top Data Science Jobs & Roles for 2019: Find What Suits You Best

“Data Scientist, the sexiest job title for the 21st century” If you have ever witnessed a discussion on data science…
Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

2019年6月14日

Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

Data Science is a massive sector, it is not just one standalone topic but a combination of many. Often, many of us…
Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

2019年6月13日

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

1. Objective In this blog, we will discuss Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning.
20 Interesting Applications of Deep Learning with Python

2019年6月12日

20 Interesting Applications of Deep Learning with Python

1. Top Python Deep Learning Applications Today, in this Deep Learning with Python Tutorial, we will see Applications of…
Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

2019年6月11日

Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

Data Science and Business Analysis are two of the most recurring terms in the industries. Like data scientists…
14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

2019年6月10日

14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

A Data Scientist is responsible for extracting, manipulating, pre-processing and generating predictions out of data. In…

See all articles

HDFS- The world’s most reliable storage system.

Santosh Bakliwal

Assistant Vice President at DataFlair

Objective

HDFS Nodes

HDFS Master (Namenode)

HDFS Slave (Datanode)

HDFS Daemons

Data storage in HDFS

Features of HDFS

Distributed storage

Blocks

Replication

High availability

Data Reliability

Fault Tolerant

Scalability

High throughput access to application data

File access and other operations

File Read Operation

Santosh Bakliwal的更多文章

社区洞察

其他会员也浏览了

Fidel Vetino Deep Dive into the Concept and World of Apache Iceberg Catalogs

Evolution of data tech stack

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

HADOOP HDFS

Hadoop “To Be or not to Be”: the necessity of migrating to other solutions to avoid future business risks

Today we going to explore how to use table formats in Delta Lake

Delta Lake has many benefits, it is not necessarily the future of data storage

Introduction

After Big Data

Objective

HDFS Nodes

HDFS Master (Namenode)

HDFS Slave (Datanode)

HDFS Daemons

Data storage in HDFS

Features of HDFS

Distributed storage

Blocks

Replication

High availability

Data Reliability

Fault Tolerant

Scalability

High throughput access to application data

File access and other operations

File Read Operation

Santosh Bakliwal的更多文章

Data Science vs Artificial Intelligence – Eliminate your Doubts

Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

7 Breathtaking Applications of Data Science in Finance

Top Data Science Jobs & Roles for 2019: Find What Suits You Best

Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

20 Interesting Applications of Deep Learning with Python

Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

社区洞察

其他会员也浏览了

Fidel Vetino Deep Dive into the Concept and World of Apache Iceberg Catalogs

Evolution of data tech stack

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

HADOOP HDFS

Hadoop “To Be or not to Be”: the necessity of migrating to other solutions to avoid future business risks

Today we going to explore how to use table formats in Delta Lake

Delta Lake has many benefits, it is not necessarily the future of data storage

Introduction

After Big Data