登录查看更多内容

What is Hadoop?

Shruti Anand

Associate Consultant at HUQUO

发布日期: 2024年1月8日

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. It's at the center of an ecosystem of big data technologies that are primarily used to support data science and advanced analytics initiatives, including predictive analytics, data mining, machine learning and deep learning.

Hadoop systems can handle various forms of structured, semistructured and unstructured data, giving users more flexibility for collecting, managing and analyzing data than relational databases and data warehouses provide. Hadoop's ability to process and store different types of data makes it a particularly good fit for big data environments. They typically involve not only large amounts of data, but also a mix of transaction data, internet clickstream records, web server and mobile application logs, social media posts, customer emails, sensor data from the internet of things (IoT) and more.

Formally known as Apache Hadoop, the technology is developed as part of an open source project within the Apache Software Foundation. Multiple vendors offer commercial Hadoop distributions, although the number of Hadoop vendors has declined because of an overcrowded market and competitive pressures driven by the increased deployment of big data systems in the cloud.

The shift to the cloud also enables users to store data in lower-cost cloud object storage services instead of Hadoop's namesake file system. As a result, Hadoop's role has been reduced in many big data architectures and the framework has been partially eclipsed by other technologies, such as the Apache Spark processing engine and the Apache Kafka event streaming platform.

How does Hadoop work for big data management and analytics?

Hadoop runs on commodity servers and can scale up to support thousands of hardware nodes. Its file system is designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Those features helped Hadoop become a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.

Because Hadoop can process and store such a wide assortment of data, it enables organizations to set up data lakes as expansive reservoirs for incoming streams of information. In a Hadoop data lake, raw data is often stored as is so data scientists and other analysts can access the full data sets, if need be; the data is then filtered and prepared by analytics or data management teams to support different applications.

Data lakes generally serve different purposes than traditional data warehouses that hold cleansed sets of transaction data. But the growing role of big data analytics in business decision-making has made effective data governance and data security processes a priority in Hadoop deployments. Hadoop can also be used in data lakehouses, a newer type of platform that combines the key features of data lakes and data warehouses, although they're more commonly built on top of cloud object storage.

领英推荐

HDFS

Darshika Srivastava 1 年前

Hadoop Ecosystem

Omar Khaled 4 个月前

What is the future of Hadoop?

Naveen Joshi 8 年前

Hadoop's 4 main modules

Hadoop includes the following four modules as its primary components:

Hadoop Distributed File System (HDFS). It's the primary data storage system for Hadoop applications and also manages access to the data in clusters. HDFS is built around an architecture of NameNodes and DataNodes. Each cluster includes one NameNode, a master node that manages the file system's namespace and controls file access, plus multiple DataNodes that manage storage on the servers that make up the cluster.

Hadoop YARN. Short for Yet Another Resource Negotiator, but typically referred to by the acronym alone, YARN is Hadoop's cluster resource management and job scheduling technology In Hadoop clusters, YARN sits between HDFS and the processing engines deployed by users. It uses a combination of containers, application coordinators and node-level monitoring agents to dynamically allocate cluster resources to applications and oversee the execution of processing jobs. YARN supports multiple job scheduling approaches, including a first-in-first-out queue and several methods that schedule jobs based on assigned cluster resources.
Hadoop MapReduce. This is a built-in programming framework and processing engine for running batch applications that group together various transactions or processes. As its name indicates, MapReduce uses map and reduce functions to split processing jobs into multiple tasks that run at the cluster nodes where data is stored and then to combine what each task produces into a coherent set of results. It supports parallel processing of large data sets across Hadoop clusters in a fault-tolerant way.
Hadoop Common. A set of shared utilities and libraries, Hadoop Common supports the other modules by providing various cluster configuration, management and security features. Examples include a key management server, a secure mode for user authentication and access, a registry of application services running in a cluster and a service-level authorization mechanism to help ensure that clients have permission to access particular services.

Hadoop's benefits for users

Despite the emergence of alternative options, especially in the cloud, Hadoop can still benefit big data users for the following reasons:

It can store and process vast amounts of structured, semistructured and unstructured data.
It protects applications and data processing against hardware failures. If one node in a cluster goes down, Hadoop automatically redirects processing jobs to other nodes to ensure that applications continue to run.
It doesn't require that data be processed before being stored. Organizations can store raw data in HDFS and decide later how to process and filter it for specific analytics uses.
It's scalable -- companies can add more nodes to clusters when needed to handle more data or increased processing workloads.
It can support real-time analytics applications to help drive better operational decision-making, as well as batch workloads for historical analysis.

Overall, Hadoop enables organizations to collect, store and analyze more data, which can expand analytics applications and provide information to business executives, managers and workers that they previously couldn't get.

要查看或添加评论，请登录

Shruti Anand的更多文章

What Is the Difference Between Inbound and Outbound

2025年3月19日

What Is the Difference Between Inbound and Outbound

Typically, a place that maps more incoming calls is called an inbound call center. On the other hand, centers that make…
What Is Procurement Data Management?

2025年3月18日

What Is Procurement Data Management?

Procurement data management is the process of collecting, organizing, and managing all information related to the…
Data Visualization

2025年3月17日

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts…
What is Metadata?

2025年3月13日

What is Metadata?

Often referred to as data that describes other data, metadata is structured reference data that helps to sort and…
What Is Loss Given Default (LGD)?

2025年3月12日

What Is Loss Given Default (LGD)?

Loss given default (LGD) is the estimated amount of money a bank or other financial institution loses when a borrower…
Tableau

2025年3月10日

Tableau

Tableau helps people and organizations be more data-driven As the market-leading choice for modern business…
What is Kubernetes?

2025年3月8日

What is Kubernetes?

Kubernetes, also known as k8s or kube, is an open source container orchestration platform for scheduling and automating…
What is Data Visualization?

2025年3月7日

What is Data Visualization?

Data visualization is the graphical representation of information and data. By using visual elements like charts…
What is PySpark?

2025年3月6日

What is PySpark?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for…
What Is Azure Data Factory(ADF)?

2025年3月5日

What Is Azure Data Factory(ADF)?

Azure data factory will help you to automate and manage the workflow of data that is being transferred from on-premises…

See all articles

What is Hadoop?

Shruti Anand

Associate Consultant at HUQUO

How does Hadoop work for big data management and analytics?

领英推荐

Hadoop's 4 main modules

Hadoop's benefits for users

Shruti Anand的更多文章

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop Market All Set To Grow At CAGR 37.3%, Market Value To Reach USD 851.4 billion By 2030

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Hadoop Operation Service Market Seeking Excellent Growth| Hortonworks, Cloudera, SAP, Google

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?

How does Hadoop work for big data management and analytics?

领英推荐

Hadoop's 4 main modules

Hadoop's benefits for users

Shruti Anand的更多文章

What Is the Difference Between Inbound and Outbound

What Is Procurement Data Management?

Data Visualization

What is Metadata?

What Is Loss Given Default (LGD)?

Tableau

What is Kubernetes?

What is Data Visualization?

What is PySpark?

What Is Azure Data Factory(ADF)?

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop Market All Set To Grow At CAGR 37.3%, Market Value To Reach USD 851.4 billion By 2030

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Hadoop Operation Service Market Seeking Excellent Growth| Hortonworks, Cloudera, SAP, Google

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

??Hadoop: how to contribute limited/specific amount of storage as slave to the cluster?