登录查看更多内容

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Nivas Srinivasan

Big Data Engineer |Data Bricks-Certified Data Engineer Associate| Hadoop| Hive| SQL |Spark |Python | Power BI| Microsoft Certified Azure (AZ-900)Fundamentals |Ex- Accenture

发布日期: 2024年6月24日

Introduction

Welcome to the world of Hadoop! If you're diving into big data, you've likely heard about Hadoop and its powerful ecosystem. In this blog, we'll explore the core components of Hadoop: HDFS, MapReduce, and YARN. We'll break down each component, discuss their roles, and provide practical examples to help you understand how they work together to process and manage vast amounts of data.

What is Hadoop?

Hadoop is an open-source framework designed to store and process large datasets across clusters of computers. It uses simple programming models and can scale up from a single server to thousands of machines. The Hadoop ecosystem comprises various tools and components, with HDFS, MapReduce, and YARN being the foundational pillars.

HDFS (Hadoop Distributed File System)

HDFS is the storage layer of Hadoop. It is designed to store very large datasets reliably and to stream those data sets at high bandwidth to user applications. Let's break down its key features:

Distributed Storage: Data is split into blocks and distributed across multiple nodes in the cluster.
Fault Tolerance: Data is replicated across multiple nodes to ensure reliability and availability.
Scalability: Can easily scale out by adding more nodes to the cluster.
High Throughput: Optimized for large data access patterns.

Practical Example: Uploading a File to HDFS

# Assuming Hadoop is installed and configured hdfs dfs -mkdir /user/nivas/data hdfs dfs -put localfile.txt /user/nivas/data

In this example, we create a directory in HDFS and upload a local file to it.

MapReduce

MapReduce is the processing layer of Hadoop. It is a programming model used for processing large data sets with a parallel, distributed algorithm on a cluster. The process is divided into two phases:

Map Phase: Processes input data and converts it into a set of key-value pairs.
Reduce Phase: Takes the output from the Map phase and combines those key-value pairs into a smaller set of results.

Logeswaran S 11 个月前

Hadoop Ecosystem and Their Components

Smriti Saini 4 年前

Hadoop?—?Is it worth the hype?

Siddhant Mishra 6 年前

Practical Example: Word Count Program

Let's write a simple MapReduce program to count the number of occurrences of each word in a text file.

// Mapper Class public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } // Reducer Class public class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } // Driver Class public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

This program reads text files, maps words to counts, and reduces them to a final count.

YARN (Yet Another Resource Negotiator)

YARN is the resource management layer of Hadoop. It allows multiple data processing engines such as batch processing, stream processing, interactive processing, and real-time processing to run and process data stored in HDFS. Key features include:

Resource Management: Allocates resources to various applications.
Job Scheduling: Manages the scheduling of tasks across the cluster.
Scalability: Supports dynamic resource management for improved cluster utilization.

Practical Example: Running a YARN Job

# Submit a MapReduce job to YARN hadoop jar /path/to/your/mapreduce/program.jar input_directory output_directory

This command submits a MapReduce job to YARN, which handles the resource allocation and job scheduling.

Interactive Template

To enhance your understanding, here's an interactive template for exploring Hadoop:

Set Up Hadoop Cluster: Use virtual machines or a cloud provider to set up a small Hadoop cluster.
Upload Data to HDFS: Use the provided HDFS commands to upload and manage data.
Write and Run MapReduce Jobs: Implement the provided MapReduce example and run it on your cluster.
Monitor with YARN: Use YARN's web interface to monitor resource usage and job progress.

Conclusion

The Hadoop ecosystem is a powerful suite of tools for managing and processing big data. By understanding HDFS, MapReduce, and YARN, you can leverage Hadoop's capabilities to handle large datasets efficiently. Start experimenting with Hadoop, upload some data, write MapReduce programs, and see YARN in action. Happy data processing!

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Nivas Srinivasan

Big Data Engineer |Data Bricks-Certified Data Engineer Associate| Hadoop| Hive| SQL |Spark |Python | Power BI| Microsoft Certified Azure (AZ-900)Fundamentals |Ex- Accenture

Introduction

What is Hadoop?

HDFS (Hadoop Distributed File System)

Practical Example: Uploading a File to HDFS

MapReduce

领英推荐

Practical Example: Word Count Program

YARN (Yet Another Resource Negotiator)

Practical Example: Running a YARN Job

Interactive Template

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Hadoop?—?Is it worth the hype?

Some Key Features in Hadoop/MapReduce Performance Tunning

The Evolution of Apache Hadoop: A Revolutionary Big Data Framework

Best Big Data framework: Apache Spark Vs Hadoop Mapreduce

Hadoop Ecosystem

How To Give Limited Storage In Hadoop Cluster?

Top 4 Components of Hadoop Services in London UK Architecture

What Are The Key Differences Between Spark And Hadoop?

To Hadoop, or not to Hadoop?

Hadoop Ecosystem

Introduction

What is Hadoop?

HDFS (Hadoop Distributed File System)

Practical Example: Uploading a File to HDFS

MapReduce

领英推荐

Practical Example: Word Count Program

YARN (Yet Another Resource Negotiator)

Practical Example: Running a YARN Job

Interactive Template

Conclusion

Big Data Storage Solutions: Comparing HDFS, Amazon S3,Azure ADLS Gen2 and Google Cloud Storage.

2024年6月24日

社区洞察

其他会员也浏览了

Hadoop?—?Is it worth the hype?

Some Key Features in Hadoop/MapReduce Performance Tunning

The Evolution of Apache Hadoop: A Revolutionary Big Data Framework

Best Big Data framework: Apache Spark Vs Hadoop Mapreduce

Hadoop Ecosystem

How To Give Limited Storage In Hadoop Cluster?

Top 4 Components of Hadoop Services in London UK Architecture

What Are The Key Differences Between Spark And Hadoop?

To Hadoop, or not to Hadoop?

Hadoop Ecosystem