Introduction to Hadoop
Hadoop is an apache-based open source framework written in Java programming language, which allows simple data-processing models to distribute large data across computer clusters. Applications based on this framework work in an environment that provides computing and distributed storage among computer clusters. Hadoop are designed to scale up from a computer to thousands of computers, each of which computes and stores their own local storage.
Hadoop Architecture
In fact, Hadoop has two major layers:
? Processing / calculation layer (MapReduce)
? Storage Layer (Hadoop Distributed File System or HDFS)
MapReduce
MapReduce is a parallel programming model for writing distributed Google-based programs to efficiently process large amounts of data onto large commodity hardware clusters in a reliable and error-tolerant manner. MapReduce runs on Hadoop, which is an Apache open source framework.
Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system designed to run on commodity hardware. This file system is very similar to the existing distributed file system. However, its differences with other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low cost hardware. HDFS provides high availability to application data and is suitable for programs with large data sets.
Except for the two layers mentioned above, the Hadoop framework includes the following two modules:
? Hadoop Common: This module contains Java libraries and tools required by other Hadoop modules.
? Hadoop YARN: This module is a framework for work scheduling and cluster resource management.
How does Hadoop works?
Making bigger servers with heavier settings that process large-scale processing is very expensive, but as an alternative solution, you can connect many commodity computers with a single processor as a distribution system, the clustered machines can read the data set in parallel and provide much more throughput.Therefore, this is a motivating factor in using Hadoop, which runs across cluster and low cost machines.
Hadoop runs the code in a group of computers. This process includes the following main tasks that Hadoop does:
? Data is first divided into directories and files. Files are divided into identical blocks of 128 and 64 MB (preferably 128 MB).
? Then these files are distributed among different cluster nodes for further processing.
? HDFS, which is located at the top of the local file system, monitors the processing.
? Blocks are replicated for use when hardware failure occurs.
? Checks whether the code has been executed correctly
? Performs sorting of data where the Map and Reduce steps occur.
? Sorted data is sent to a specific computer.
? Writes debugging logs for each job.
Advantages of Hadoop
? The Hadoop framework allows the user to easily write and test distribution systems. The header is efficient and automatically distributes the data, which in turn uses the underlying parallel cpu cores.
? Hadoop does not rely on hardware to provide Fault Tolerance and High Availability (FTHA); instead, the Hadoop library is designed to detect and fix errors in the application layer.
? Servers can be dynamically added to or removed from the cluster, and Hadoop will continue to operate without interruption.
? Another great benefit of Hadoop is that, apart from being open source, it can be compatible with all operating systems because it works on a Java basis.