Introduction to Hadoop

Introduction to Hadoop

Hadoop is an apache-based open source framework written in Java programming language, which allows simple data-processing models to distribute large data across computer clusters. Applications based on this framework work in an environment that provides computing and distributed storage among computer clusters. Hadoop are designed to scale up from a computer to thousands of computers, each of which computes and stores their own local storage.

Hadoop Architecture

In fact, Hadoop has two major layers:

? Processing / calculation layer (MapReduce)

? Storage Layer (Hadoop Distributed File System or HDFS)

MapReduce

MapReduce is a parallel programming model for writing distributed Google-based programs to efficiently process large amounts of data onto large commodity hardware clusters in a reliable and error-tolerant manner. MapReduce runs on Hadoop, which is an Apache open source framework.

Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system designed to run on commodity hardware. This file system is very similar to the existing distributed file system. However, its differences with other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low cost hardware. HDFS provides high availability to application data and is suitable for programs with large data sets.

Except for the two layers mentioned above, the Hadoop framework includes the following two modules:

? Hadoop Common: This module contains Java libraries and tools required by other Hadoop modules.

? Hadoop YARN: This module is a framework for work scheduling and cluster resource management.

How does Hadoop works?

Making bigger servers with heavier settings that process large-scale processing is very expensive, but as an alternative solution, you can connect many commodity computers with a single processor as a distribution system, the clustered machines can read the data set in parallel and provide much more throughput.Therefore, this is a motivating factor in using Hadoop, which runs across cluster and low cost machines.

Hadoop runs the code in a group of computers. This process includes the following main tasks that Hadoop does:

? Data is first divided into directories and files. Files are divided into identical blocks of 128 and 64 MB (preferably 128 MB).

? Then these files are distributed among different cluster nodes for further processing.

? HDFS, which is located at the top of the local file system, monitors the processing.

? Blocks are replicated for use when hardware failure occurs.

? Checks whether the code has been executed correctly

? Performs sorting of data where the Map and Reduce steps occur.

? Sorted data is sent to a specific computer.

? Writes debugging logs for each job.

Advantages of Hadoop

? The Hadoop framework allows the user to easily write and test distribution systems. The header is efficient and automatically distributes the data, which in turn uses the underlying  parallel cpu cores.

? Hadoop does not rely on hardware to provide Fault Tolerance and High Availability (FTHA); instead, the Hadoop library is designed to detect and fix errors in the application layer.

? Servers can be dynamically added to or removed from the cluster, and Hadoop will continue to operate without interruption.

? Another great benefit of Hadoop is that, apart from being open source, it can be compatible with all operating systems because it works on a Java basis.

要查看或添加评论,请登录

Babak Rezaei Bastani的更多文章

  • NameNode Server in HDFS

    NameNode Server in HDFS

    The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very…

  • HDFS Architecture (Basic concepts)

    HDFS Architecture (Basic concepts)

    HDFS is a blocked file system in which each file is split into blocks of predefined size. These blocks are stored in…

  • What is MapReduce?

    What is MapReduce?

    MapReduce is a processing method and a Java-based distribution model for distributed computing. The MapReduce algorithm…

  • HDFS goals

    HDFS goals

    Fault detection and recovery : Because HDFS contains a large number of commodity hardware, the probability of failure…

  • An overview of HDFS

    An overview of HDFS

    The Hadoop file system was developed using distributed file system design and runs on commodity hardware. Unlike other…

  • Data Science Processing Tools

    Data Science Processing Tools

    Once learned with data storage, you need to be familiar with data processing tools for converting data lakes to data…

  • Data Warehouse Bus Matrix

    Data Warehouse Bus Matrix

    The Enterprise Bus Matrix is a data warehouse planning tool developed by Ralph Kimball and is being used by numerous…

  • Data vault

    Data vault

    Data vault modeling, designed by Dan Linstedt, is a database modeling method that has been deliberately structured in…

  • Data Lake

    Data Lake

    A Data lake is a data storage tank for a large amount of raw data. Waiting for future needs, the data lake saves the…

  • Data Science Storage Tools

    Data Science Storage Tools

    The data science ecosystem has a set of tools that we use to build our solutions. The capabilities of this environment…

社区洞察

其他会员也浏览了