Hadoop

Hadoop

What is Apache Hadoop?

Apache Hadoop software is an open source framework that allows for the distributed storage and processing of large datasets across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. In this way, Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data.

Learn about how to use Dataproc to run Apache Hadoop clusters, on Google Cloud, in a simpler, integrated, more cost-effective way.

Get started for free

Hadoop history


Hadoop has its origins in the early era of the World Wide Web. As the Web grew to millions and then billions of pages, the task of searching and returning search results became one of the most prominent challenges. Startups like Google, Yahoo, and AltaVista began building frameworks to automate search results. One project called Nutch was built by computer scientists Doug Cutting and Mike Cafarella based on Google’s early work on MapReduce (more on that later) and Google File System. Nutch was eventually moved to the Apache open source software foundation and was split between Nutch and Hadoop. Yahoo, where Cutting began working in 2006, open sourced Hadoop in 2008.

While Hadoop is sometimes referred to as an acronym for High Availability Distributed Object Oriented Platform, it was originally named after Cutting’s son’s toy elephant.

Hadoop defined


Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.

Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:

Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage. This removes network latency, providing high-throughput access to application data. In addition, administrators don’t need to define schemas up front.

Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.

MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.

Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.?

Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem continues to grow and includes many tools and applications to help collect, store, process, analyze, and manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.

How does Hadoop work?


Hadoop allows for the distribution of datasets across a cluster of commodity hardware. Processing is performed in parallel on multiple servers simultaneously.

Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.

All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework.

What are the benefits of Hadoop?

Scalability

Hadoop is important as one of the primary tools to store and process huge amounts of data quickly. It does this by using a distributed computing model which enables the fast processing of data that can be rapidly scaled by adding computing nodes.

Low cost

As an open source framework that can run on commodity hardware and has a large ecosystem of tools, Hadoop is a low-cost option for the storage and management of big data.?

Flexibility

Hadoop allows for flexibility in data storage as data does not require preprocessing before storing it which means that an organization can store as much data as they like and then utilize it later.

要查看或添加评论,请登录

Rijika Roy的更多文章

  • Oracle

    Oracle

    What is Oracle? Oracle database is a relational database management system (RDBMS) from Oracle Corporation. This…

  • Tableau

    Tableau

    Introduction to Tableau Every day, we encounter data in the amounts of zettabytes and yottabytes! This enormous amount…

  • GCP

    GCP

    What is Google Cloud Platform (GCP)? Companies turn to the public cloud for various reasons, but it has become the…

  • Oracle

    Oracle

    What is Oracle? Oracle database is a relational database management system (RDBMS) from Oracle Corporation. This…

  • Python Developer

    Python Developer

    What is a Python Developer? Though there are many jobs in tech that use Python extensively — including Software…

  • Data Analytics

    Data Analytics

    What is data analytics? Data analytics converts raw data into actionable insights. It includes a range of tools…

  • MySQL

    MySQL

    What is MySQL? MySQL, the most popular Open Source SQL database management system, is developed, distributed, and…

  • What is Hive?

    What is Hive?

    Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive…

  • JAVA

    JAVA

    What is Java? Java is a widely-used programming language for coding web applications. It has been a popular choice…

  • Hadoop

    Hadoop

    Hadoop defined Hadoop is an open source framework based on Java that manages the storage and processing of large…

社区洞察

其他会员也浏览了