登录查看更多内容

Hadoop

Rijika Roy

ICICI Bank

发布日期: 2023年8月24日

What is Apache Hadoop?

Apache Hadoop software is an open source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

Learn about how to use Dataproc to run Apache Hadoop clusters, on Google Cloud, in a simpler, integrated, more cost-effective way.

Get started for free

Hadoop history

Hadoop has its origins in the early era of the World Wide Web. As the Web grew to millions and then billions of pages, the task of searching and returning search results became one of the most prominent challenges. Startups like Google, Yahoo, and AltaVista began building frameworks to automate search results. One project called Nutch was built by computer scientists Doug Cutting and Mike Cafarella based on Google’s early work on MapReduce (more on that later) and Google File System. Nutch was eventually moved to the Apache open source software foundation and was split between Nutch and Hadoop. Yahoo, where Cutting began working in 2006, open sourced Hadoop in 2008.

While Hadoop is sometimes referred to as an acronym for High Availability Distributed Object Oriented Platform, it was originally named after Cutting’s son’s toy elephant.

Hadoop defined

Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.

Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:

Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage. This removes network latency, providing high-throughput access to application data. In addition, administrators don’t need to define schemas up front.

Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.

MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.

领英推荐

Basics of Hadoop

Vivek Bansal 11 个月前

The Evolution of Apache Hadoop: A Revolutionary Big…

Sachin D N ???? 1 年前

Introduction to Hadoop

Simran Rai 1 个月前

Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.?

Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to grow and includes many tools and applications to help collect, store, process, analyze, and manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.

How does Hadoop work?

Hadoop allows for the distribution of datasets across a cluster of commodity hardware. Processing is performed in parallel on multiple servers simultaneously.

Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.

All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework.

What are the benefits of Hadoop?

Scalability

Hadoop is important as one of the primary tools to store and process huge amounts of data quickly. It does this by using a distributed computing model which enables the fast processing of data that can be rapidly scaled by adding computing nodes.

Low cost

As an open source framework that can run on commodity hardware and has a large ecosystem of tools, Hadoop is a low-cost option for the storage and management of big data.?

Flexibility

Hadoop allows for flexibility in data storage as data does not require preprocessing before storing it which means that an organization can store as much data as they like and then utilize it later.

要查看或添加评论，请登录

Rijika Roy的更多文章

Oracle

2023年8月30日

Oracle

What is Oracle? Oracle database is a relational database management system (RDBMS) from Oracle Corporation. This…
Tableau

2023年8月29日

Tableau

Introduction to Tableau Every day, we encounter data in the amounts of zettabytes and yottabytes! This enormous amount…
GCP

2023年8月28日

GCP

What is Google Cloud Platform (GCP)? Companies turn to the public cloud for various reasons, but it has become the…
Oracle

2023年8月26日

Oracle

What is Oracle? Oracle database is a relational database management system (RDBMS) from Oracle Corporation. This…
Python Developer

2023年8月25日

Python Developer

What is a Python Developer? Though there are many jobs in tech that use Python extensively — including Software…
Data Analytics

2023年8月23日

Data Analytics

What is data analytics? Data analytics converts raw data into actionable insights. It includes a range of tools…
MySQL

2023年8月22日

MySQL

What is MySQL? MySQL, the most popular Open Source SQL database management system, is developed, distributed, and…
What is Hive?

2023年8月21日

What is Hive?

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive…
JAVA

2023年8月19日

JAVA

What is Java? Java is a widely-used programming language for coding web applications. It has been a popular choice…
Hadoop

2023年8月18日

Hadoop

Hadoop defined Hadoop is an open source framework based on Java that manages the storage and processing of large…

See all articles

Hadoop

Rijika Roy

ICICI Bank

What is Apache Hadoop?

Hadoop history

Hadoop defined

领英推荐

How does Hadoop work?

What are the benefits of Hadoop?

Scalability

Low cost

Flexibility

Rijika Roy的更多文章

社区洞察

其他会员也浏览了

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Hadoop Ecosystem and Their Components

Hadoop Ecosystem

APACHE HADOOP & HDFS

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Apache? Hadoop?

Hadoop Ecosystem

Unlocking the Power of Apache Hadoop: How Companies Are Leveraging Big Data Analytics

Apache Hadoop vs Apache Spark

The History of Hadoop and Big Data

What is Apache Hadoop?

Hadoop history

Hadoop defined

领英推荐

How does Hadoop work?

What are the benefits of Hadoop?

Scalability

Low cost

Flexibility

Rijika Roy的更多文章

Oracle

Tableau

GCP

Oracle

Python Developer

Data Analytics

MySQL

What is Hive?

JAVA

Hadoop

社区洞察

其他会员也浏览了

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Hadoop Ecosystem and Their Components

Hadoop Ecosystem

APACHE HADOOP & HDFS

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Apache? Hadoop?

Hadoop Ecosystem

Unlocking the Power of Apache Hadoop: How Companies Are Leveraging Big Data Analytics

Apache Hadoop vs Apache Spark

The History of Hadoop and Big Data