登录查看更多内容

Big Data #1

Tatyana Egorenkova

Data Scientist in Oil and Gas industry

发布日期: 2023年9月17日

Hadoop review

Hadoop is an open-source framework for distributed storage and processing of large datasets. It was originally created by Doug Cutting and Mike Cafarella and is now maintained by the Apache Software Foundation. Hadoop is designed to handle massive amounts of data across a cluster of commodity hardware, making it a key technology for big data processing and analytics.

The core components of Hadoop include:

Hadoop Distributed File System (HDFS): HDFS is a distributed file system designed to store large files across multiple machines in a fault-tolerant manner. It breaks data into blocks and replicates them across the cluster to ensure data durability and availability.

MapReduce: MapReduce is a programming model and processing engine for distributed data processing in Hadoop. It allows users to write programs that can process and analyze large datasets by dividing tasks into smaller sub-tasks and distributing them across the cluster.

领英推荐

Why do we need Hadoop for Data Science - NareshIT

Naresh i Technologies 2 年前

Harnessing the Power of Hadoop A Guide to Effective…

ITPeopleNetwork 7 个月前

Harnessing the Power of Hadoop A Guide to Effective…

ITPeopleNetwork 3 个月前

YARN (Yet Another Resource Negotiator): YARN is a resource management and job scheduling component in Hadoop. It helps manage and allocate cluster resources to different applications, including MapReduce, Spark, and others, making the cluster more versatile and efficient.

Hadoop Common: This component includes libraries and utilities that are common to all Hadoop modules. It provides essential tools and interfaces for Hadoop's functioning.

Hadoop is often used in conjunction with other technologies and frameworks, such as Apache Hive (for SQL-like queries), Apache Pig (for data transformation), and Apache Spark (for in-memory data processing), to perform various data processing and analytics tasks on large datasets.

Hadoop has had a significant impact on the world of big data and has been instrumental in enabling organizations to store, process, and analyze vast amounts of data for insights and decision-making. However, it's worth noting that the big data ecosystem has evolved, and other technologies and frameworks have emerged alongside or even superseded Hadoop in some use cases.

In the next article I will write about architecture of HDFS.

要查看或添加评论，请登录

Tatyana Egorenkova的更多文章

Optimizing Fibonacci: From Recursion to Efficiency

2025年2月10日

Optimizing Fibonacci: From Recursion to Efficiency

Many times during job interviews, I was asked to write a function to calculate Fibonacci numbers. The simplest and most…
Greenplum: A Short Review

2023年6月1日

Greenplum: A Short Review

Knowledge of the advantages and disadvantages of databases is critical for a data scientist, especially for a data…

Big Data #1

Tatyana Egorenkova

Data Scientist in Oil and Gas industry

Hadoop review

领英推荐

Tatyana Egorenkova的更多文章

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

HADOOP

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Understanding Hadoop: A Foundation for Big Data Processing

Comparison between Hadoop, Spark and Storm

Hadoop vs Spark Comparison

Hadoop 3: Comparison with Hadoop 2 and Spark

Hadoop

Unlocking Big Data: Demystifying Hadoop as a Distributed Database

Hadoop

Hadoop review

领英推荐

Tatyana Egorenkova的更多文章

Optimizing Fibonacci: From Recursion to Efficiency

Greenplum: A Short Review

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

HADOOP

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Understanding Hadoop: A Foundation for Big Data Processing

Comparison between Hadoop, Spark and Storm

Hadoop vs Spark Comparison

Hadoop 3: Comparison with Hadoop 2 and Spark

Hadoop

Unlocking Big Data: Demystifying Hadoop as a Distributed Database

Hadoop