Apache Hadoop

Apache Hadoop

Apache Hadoop?is a collection of?open-source?software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a?software framework?for?distributed storage?and processing of?big data?using the?MapReduce?programming model. Hadoop was originally designed for?computer clusters?built from?commodity hardware, which is still the common use.[3]?It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers?packaged code?into nodes to process the data in parallel. This approach takes advantage of?data locality, where nodes manipulate the data they have access to. This allows the dataset to be?processed?faster and more efficiently than it would be in a more conventional?supercomputer architecture?that relies on a?parallel file system?where computation and data are distributed via high-speed networking.

History

According to its co-founders,?Doug Cutting?and?Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003. This paper spawned another one from Google?– "MapReduce: Simplified Data Processing on Large Clusters". Development started on the?Apache Nutch?project, but was moved to the new Hadoop subproject in January 2006. Doug Cutting, who was working at?Yahoo!?at the time, named it after his son's toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.

In March 2006, Owen O’Malley was the first committer to add to the Hadoop project; Hadoop 0.1.0 was released in April 2006.?It continues to evolve through contributions that are being made to the project. The very first design document for the Hadoop Distributed File System was written by Dhruba Borthakur in 2007.

Architecture

No alt text provided for this image


Apache HDFS?or?Hadoop Distributed File System?is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a?Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes).?HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines.

No alt text provided for this image

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes (slave nodes).?NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only.?

Functions of NameNode:

  • It is the master daemon that maintains and manages the DataNodes (slave nodes)
  • It records the metadata of all the files stored in the cluster, e.g. The?location of blocks stored,?the?size of the files, permissions, hierarchy, etc. There are two files associated with the metadata:
  • FsImage:?It contains the complete state of the file system namespace since the start of the NameNode.
  • EditLogs:?It contains all the recent modifications made to the file system with respect to the most recent FsImage.
  • It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
  • It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
  • It keeps a record of all the?blocks in HDFS and in which nodes these blocks are located.
  • The NameNode is also responsible to take care of the?replication?factor?of?all the blocks which we will discuss in detail later in this HDFS tutorial blog.
  • In?case of the DataNode failure, the NameNode chooses new DataNodes for new replicas,?balance?disk usage and manages the communication traffic to the DataNodes.

?

DataNode:

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNode:

  • These are slave daemons or process which runs on each slave machine.
  • The actual data is?stored on DataNodes.
  • The DataNodes?perform the low-level read and write requests?from the file system’s clients.
  • They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

Till now, you must have realized that the NameNode is pretty much important to us. If it fails, we are doomed.?But don’t worry, we will be talking about how Hadoop solved this single point of failure problem in the next Apache Hadoop HDFS Architecture blog. So, just relax for now and let’s take one step at a time.

Secondary NameNode:

Apart from these two daemons, there is a third daemon or a?process called Secondary NameNode. The Secondary NameNode works concurrently with the primary NameNode as a?helper daemon.?And don’t be confused about the Secondary NameNode being a?backup NameNode because it is not.

Functions of Secondary NameNode:

  • The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
  • It is responsible for combining the?EditLogs?with?FsImage?from the NameNode.?
  • It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called CheckpointNode.

Blocks:

Blocks are the nothing but the smallest continuous location on?your hard drive where data is stored. In general, in any of the File System, you store the data as a?collection of blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.

Replication Management:

HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The blocks are also replicated to provide fault tolerance. The default replication factor is 3 which is again configurable. So, as you can see in the figure below where each block is replicated three times and stored on different DataNodes (considering the default replication factor):?

The NameNode also ensures that all the replicas are not stored on the same rack or a single rack. It follows an in-built Rack Awareness Algorithm to reduce latency as well as provide fault?tolerance. Considering the replication factor is 3, the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a different (remote) rack but, on a different DataNode within that (remote) rack as shown in the?figure above. If you have more replicas, the rest of the replicas will be placed on random DataNodes provided not more than two replicas reside on the same rack, if possible.

PROS:

1. Cost

Hadoop is?open-source?and uses cost-effective?commodity hardware?which provides a cost-efficient model, unlike traditional Relational databases that require expensive hardware and high-end processors to deal with Big Data. The problem with traditional Relational databases is that storing the Massive volume of data is not cost-effective, so the company’s started to remove the Raw data. which may not result in the correct scenario of their business. Means Hadoop provides us 2 main benefits with the cost one is it’s open-source means free to use and the other is that it uses commodity hardware which is also inexpensive.

2. Scalability

Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is processed parallelly. the number of these machines or nodes can be increased or decreased as per the enterprise’s requirements. In traditional?RDBMS(Relational DataBase Management System) the systems can not be scaled to approach large amounts of data.

3. Flexibility

Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of data independent of its structure which makes it highly flexible. which is very much useful for enterprises as they can process large datasets easily, so the businesses can use Hadoop to analyze valuable insights of data from sources like social media, email, etc. with this flexibility Hadoop can be used with log processing, Data Warehousing, Fraud detection, etc.

4. Speed

Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a large size file is broken into small size file blocks then distributed among the Nodes available in a Hadoop cluster, as this massive number of file blocks are processed parallelly which makes Hadoop faster, because of which it provides a High-level performance as compared to the traditional DataBase Management Systems. When you are dealing with a large amount of unstructured data speed is an important factor, with Hadoop you can easily access TB’s of data in just a few minutes.

5. Fault Tolerance

Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed. You can read all of the data from a single machine if this machine faces a technical issue data can also be read from other nodes in a Hadoop cluster because the data is copied or replicated by default. Hadoop makes 3 copies of each file block and stored it into different nodes.

6. High Throughput

Hadoop works on Distributed file System where various jobs are assigned to various Data node in a cluster, the bar of this data is processed parallelly in the Hadoop cluster which produces high throughput. Throughput is nothing but the task or job done per unit time.

7. Minimum Network Traffic

In Hadoop, each task is divided into various small sub-task which is then assigned to each data node available in the Hadoop cluster. Each data node process a small amount of data which leads to low traffic in a Hadoop cluster.

Cons

1. Problem with Small files

Hadoop can efficiently perform over a small number of files of large size. Hadoop stores the file in the form of file blocks which are from 128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small size file in a large amount. This so many small files surcharge the Namenode and make it difficult to work.

2. Vulnerability

Hadoop is a framework that is written in java, and java is one of the most commonly used programming languages which makes it more insecure as it can be easily exploited by any of the cyber-criminal.

3. Low Performance In Small Data Surrounding

Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for the organizations that are generating a massive volume of data. It’s efficiency decreases while performing in small data surroundings.

4. Lack of Security

Data is everything for an organization, by default the security feature in Hadoop is made un-available. So the Data driver needs to be careful with this security face and should take appropriate action on it. Hadoop uses?Kerberos?for security feature which is not easy to manage. Storage and network encryption are missing in Kerberos which makes us more concerned about it.

5. High Up Processing

Read/Write operation in Hadoop is immoderate since we are dealing with large size data that is in TB or PB. In Hadoop, the data read or write done from the disk which makes it difficult to perform in-memory calculation and lead to processing overhead or High up processing.

6. Supports Only Batch Processing

The batch process is nothing but the processes that are running in the background and does not have any kind of interaction with the user. The engines used for these processes inside the Hadoop core is not that much efficient. Producing the output with low latency is not possible with it.

Features of Apache Hadoop:

1) Distributed Processing & Storage :

The framework itself provides great flexibility and manages the distributed processing and distributed storage by itself leaving only the custom logic to be built for data processing by users. That made Apache Hadoop different from other distributed systems and become highly popular so quickly.

2) Highly Available & Fault Tolerant :

Hadoop is highly available and provides fault tolerance both in terms of data availability and distributed processing. The data is stored in HDFS where data automatically gets replicated at two other locations. So, even if one or two of the systems collapse, the file is still available on the third system at least. This brings a high level of fault tolerance. The highly distributed MapReduce batch processing engine provides high availability in terms of processing failure due to hardware / machine failure.

3) Highly & Easily Scalable :

Both vertical and horizontal scaling is possible. The differentiator however is the horizontal scaling where new nodes can be easily added in the system on the fly as and when data volume of processing needs grow without altering anything in the existing systems or programs.

4) Data Reliability :

The data is stored reliably due to data replication in cluster where multiple copies are maintained on different nodes. The framework itself provides mechanisms to ensure data reliability by Block Scanner & Volume Scanner, Directory Scanner and Disk Checker. In case of data corruption and hardware failures the data integrity and availability it maintained.

5) Robust Ecosystem :

Hadoop has a very robust ecosystem that is well suited to meet the analytical needs of developers and small to large organizations. Hadoop Ecosystem comes with a suite of tools and technologies making it a very much suitable to deliver to a variety of data processing needs. Just to name a few, Hadoop ecosystem comes with projects such as MapReduce, Yarn, Hive, HBase, Zookeeper, Pig, Flume, Avro etc. and many new tools and technologies are being added to the ecosystem as the market grows.

6) Very Cost effective :

Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data.

7) Open Source :

No worries for licensing cost with very strong open source community support. Can be easily accommodated and build for custom requirement.

Here are examples of Hadoop use cases:

  1. Financial services companies use analytics to assess risk, build investment models, and create trading algorithms; Hadoop has been used to help build and run those applications.
  2. Retailers use it to help analyze structured and unstructured data to better understand and serve their customers.
  3. In the asset-intensive energy industry Hadoop-powered analytics are used for predictive maintenance, with input from Internet of Things (IoT) devices feeding data into big data programs.
  4. There are numerous public sector programs, ranging from anticipating and preventing disease outbreaks to crunching numbers to catch tax cheats.

Job Predictions in Big Data Analysis

By 2023, the big data analytics market will reach $103 billion as per the predictions. IBM predicts that the?demand for the data scientist?will soar?28%.

Especially in the finance industry, the insurance industry and IT industries demand 59% of all data scientists’ jobs.

Hadoop Architect

Hadoop Architect is the one who plans and designs the Big Data Hadoop architecture. He creates requirement analysis and manages development and deployment across Hadoop applications.

Big Data Analyst

Big Data Analyst analyses big data for evaluating companies technical performance and giving recommendations on system enhancement. They execute big data processes like text annotation, parsing, filtering enrichment.

Hadoop Developer

The main task of the Hadoop developer is to develop Hadoop technologies using Java, HQL, and scripting languages.

Hadoop Tester

The Hadoop tester test for errors and bugs and fixes the bugs. He makes sure that the?MapReduce?jobs, HiveQL scripts, and Pig Latin scripts work properly.

As far as prediction goes?big data trend?is going to stretch across the globe. Picking up skills with open source tools like Hadoop, Spark, Kafka, Flink one can land with promising big data jobs.







要查看或添加评论,请登录

Anubhuti Kiran的更多文章

  • Search engine optimization (SEO)

    Search engine optimization (SEO)

    Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website…

  • Root cause analysis

    Root cause analysis

    Root cause analysis usually referred to as RCA is an approach used to analyze serious problems before trying to solve…

  • HubSpot

    HubSpot

    HubSpot is a platform provider that brings entire companies together to optimise workflow, information sharing and…

  • Digital marketing

    Digital marketing

    Digital marketing is any marketing initiative that leverages online media and the internet through connected devices…

  • Oracle Database

    Oracle Database

    Oracle Database (commonly referred to as Oracle DBMS or simply as Oracle) is a multi-model database management system…

  • Microsoft Dynamics 365

    Microsoft Dynamics 365

    Dynamics 365 is a set of interconnected, modular Software-as-a-Service (SaaS) applications and services designed to…

  • Digital Analytics

    Digital Analytics

    Digital Analytics is now a key part of the digital marketing strategy of a company. With many organizations adopting…

  • QlikView

    QlikView

    QlikView is a leading Business Discovery Platform. It is unique in many ways as compared to the traditional BI…

  • KNIME

    KNIME

    KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform…

  • Apache Spark

    Apache Spark

    Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface…

社区洞察

其他会员也浏览了