Hadoop Tutorial – A Comprehensive Guide for beginners
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
1.Objective
This Hadoop tutorial provides thorough introduction of Hadoop. The tutorial covers what is Hadoop, what is the need of Hadoop, why hadoop is most popular, Hadoop Architecture, data flow, Hadoop daemons, different flavours, introduction of Hadoop componenets like hdfs, MapReduce, Yarn, etc.
2.Hadoop Introduction
Hadoop is an open source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and even its source code can be changed as per the requirements. If certain functionality does not fulfill your requirement, you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Hadoop provides parallel processing of data as it works on multiple machines simultaneously.
It is inspired by Google, which has written a paper about the technologies it is using like Map-Reduce programming model as well as its file system (GFS). Hadoop was originally written for the Nutch search engine project when Doug cutting and his team were working on it but very soon, it became a top-level project due to its huge popularity.
Hadoop is an open source framework which is written in Java. But this does not mean you can code only in Java. You can code in C, C++, perl, python, ruby etc. You can code in any language but it is recommended to code in java as you will have lower level control of the code.
It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is developed for processing of huge volume of data. Commodity hardware are the low end hardware, they are cheap devices which are very economic. So hadoop is very economic.
Hadoop can be setup on single machine (pseudo distributed mode), but real power of Hadoop comes with a cluster of machines, it can be scaled to thousand nodes on the fly ie, without any downtime. We need not make any system down to add more systems in the cluster.
Hadoop consists of three key parts – Hadoop Distributed File System (HDFS), Map-Reduce and YARN. HDFS is the storage layer, Map Reduce is the processing layer and YARN is the resource management layer
3.Why Hadoop?
Let us now understand why Hadoop is very popular, why Hadoop has captured more than 90% of big data market.
Hadoop is not only a storage-system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault tolerant (Even if nodes go down, data can be processed by other node) and Open source (can modify the source code if required).
Following characteristics of Hadoop make is unique platform:
- Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by single schema.
- Excels at processing data of complex nature, its scale-out architecture divides workloads across multiple nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
- Scales economically, as discussed it can be deployed on commodity hardware. Apart from this its open-source nature guards against vendor lock.