登录查看更多内容

HDFS Architecture (Basic concepts)

Babak Rezaei Bastani

Senior Web Developer

发布日期: 2019年7月11日

HDFS is a blocked file system in which each file is split into blocks of predefined size. These blocks are stored in clusters of one or more machines. The architecture of this file system follows the master / slave architecture, such as a cluster, a single name node (Master Node), and all other nodes, which are the DataNode (Slave Nodes). HDFS can be installed on a range of Java-supported machines. Although anyone can run multiple DataNodes on a machine, but in the world of applications, these DataNodes are published among different machines.

NameNode

The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very high-availability server that manages the file namespace and controls user access to files. The HDFS architecture is such that user data is not mounted on NameNode. These data are only mounted on NameNodes.

DataNode

Subsidiary nodes (slaves) are in HDFS. Unlike NameNode, there is a commodity hardware, that is, a cheap system that does not have high quality. The DataNode is a block server that stores data in an ex3 or ex4 local file.

Blocks

Blocks are nothing but the smallest sequential location on your hard disk, where the data is stored. In general, in each file system, you save the data as a set of blocks. Similarly, HDFS stores each file in blocks that are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) that you can configure as you need it.

It is not necessary to save each file in HDFS to the exact size of the configured block. Let's take an example from the example.txt file of 514 MB in the form shown above. Suppose we use the default block size setting, which is 128 MB. So, how many blocks will be created? 5. The first four blocks are 128 MB. But, the last block is only 2 megabytes.

Well, whenever we talk about HDFS, we talk about large data sets, namely, Terabyte and Petabyte of data. So, if we use a block size of 4 KB, like the Linux file system, we need to have a lot of blocks and over-the-counter metadata. Therefore, managing this number of blocks and metadata generates huge overheads, and that's what we do not want.

要查看或添加评论，请登录

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

2019年7月11日

NameNode Server in HDFS

The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very…
What is MapReduce?

2019年6月30日

What is MapReduce?

MapReduce is a processing method and a Java-based distribution model for distributed computing. The MapReduce algorithm…
HDFS goals

2019年6月28日

HDFS goals

Fault detection and recovery : Because HDFS contains a large number of commodity hardware, the probability of failure…
An overview of HDFS

2019年6月28日

An overview of HDFS

The Hadoop file system was developed using distributed file system design and runs on commodity hardware. Unlike other…
Introduction to Hadoop

2019年6月27日

Introduction to Hadoop

Hadoop is an apache-based open source framework written in Java programming language, which allows simple…
Data Science Processing Tools

2019年6月11日

Data Science Processing Tools

Once learned with data storage, you need to be familiar with data processing tools for converting data lakes to data…
Data Warehouse Bus Matrix

2019年6月8日

Data Warehouse Bus Matrix

The Enterprise Bus Matrix is a data warehouse planning tool developed by Ralph Kimball and is being used by numerous…
Data vault

2019年6月8日

Data vault

Data vault modeling, designed by Dan Linstedt, is a database modeling method that has been deliberately structured in…
Data Lake

2019年6月7日

Data Lake

A Data lake is a data storage tank for a large amount of raw data. Waiting for future needs, the data lake saves the…
Data Science Storage Tools

2019年6月6日

Data Science Storage Tools

The data science ecosystem has a set of tools that we use to build our solutions. The capabilities of this environment…

See all articles

HDFS Architecture (Basic concepts)

Babak Rezaei Bastani

Senior Web Developer

NameNode

DataNode

Blocks

Babak Rezaei Bastani的更多文章

社区洞察

其他会员也浏览了

Configuration of HDFS Cluster with Ansible

Hadoop Cluster Revealed

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

Apache Spark on YARN Architecture

Impala

Getting started with Apache Spark

ARTH-Task 11.1 Config Hadoop & Start Service vis Ansible

Hadoop 3: Comparison with Hadoop 2 and Spark

INTEGRATION OF LVM Partition WITH HADOOP CLUSTER

NameNode

DataNode

Blocks

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

What is MapReduce?

HDFS goals

An overview of HDFS

Introduction to Hadoop

Data Science Processing Tools

Data Warehouse Bus Matrix

Data vault

Data Lake

Data Science Storage Tools

社区洞察

其他会员也浏览了

Configuration of HDFS Cluster with Ansible

Hadoop Cluster Revealed

CONFIGURE HADOOP AND START CLUSTER SERVICES USING ANSIBLE PLAYBOOK:-

Apache Spark on YARN Architecture

Impala

Getting started with Apache Spark

ARTH-Task 11.1 Config Hadoop & Start Service vis Ansible

Hadoop 3: Comparison with Hadoop 2 and Spark

INTEGRATION OF LVM Partition WITH HADOOP CLUSTER