登录查看更多内容

The Best Guide to Hadoop MapReduce.

Santosh Bakliwal

Assistant Vice President at DataFlair

发布日期: 2017年1月28日

1. Objective

This MapReduce tutorial describes all the concepts of MapReduce in great details. Understand what is Mapper, Reducer, shuffling, sorting, etc. This comprehensive Guide of MapReduce also covers internals of MapReduce, DataFlow, architecture, Data locality as well.

2. MapReduce Introduction

Let’s discuss what is MapReduce, how it divides the work into sub-work, why MapReduce is one of the best paradigm to process data distributedly:

Map Reduce is the processing layer of Hadoop. MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You just need to put business logic in the way map reduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by user to master is divided into small small works (tasks) and assigned to slaves.

MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. Here in map reduce we get input as a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to map reduce as here parallel processing is done.

3. MapReduce – High level understanding

Let’s understand basic of MapReduce, at high level how MapReduce looks like, what, why and how of MapReduce

Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers. Problem is divided into large number of smaller problems each of which are processed independently to give individual outputs. These individual outputs are further processed to give final output.

Map-Reduce is highly scalable and can be used across many computers. Many small machines can be used to process jobs that normally could not be processed by a large machine.

4. MapReduce Terminologies:

Understand different terminologies and concepts of MapReduce, what is Map, what is reduce, what is job, task, task attempt, etc.

Map-Reduce is the data processing component of Hadoop. Conceptually, Map-Reduce programs transform lists of input data elements into lists of output data elements. A Map-Reduce program will do this twice, using two different list processing idioms

Map
Reduce

In between Map and Reduce, there is small phase called Shuffle and Sort.

Let’s understand basic terminologies used in Map Reduce.

Job – A “full program” – an execution of a Mapper and Reducer across a data set. It is execution of 2 processing layers ie mapper and reducer.

A mapreduce job is the work that the client wants to be performed. It consists of the input data, the MapReduce Program and configuration info. So client need to submit input data, he need to write Map Reduce program and set the configuration info (These were provided during hadoop setup in configuration file and also we specify some configurations in our program itself which will be specific to our map reduce job).

Task – An execution of a Mapper or a Reducer on a slice of data .It is also called Task-In-Progress (TIP). It means processing of data is in progress either on mapper or reducer .

Task Attempt – A particular instance of an attempt to execute a task on a node. There is a possibility that anytime any machine can go down. For example while processing data if any node goes down, framework reschedules the task to some other node. This rescheduling of task cannot be infinite. There is an upper limit for that as well. Default value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then the job is considered as a failed job. For high priority job or huge job, the value of this task attempt can be increased as well.

Install Hadoop and play with MapReduce.

5. Map Abstraction

Let us understand abstract form of Map, the first phase of MapReduce paradigm, what is a map / mapper, what is the input to the mapper, how it processes the data, what is output from the mapper:

Map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.

Key is a reference to the input value
Value is the data set on which to operate

Map Processing:

Function defined by user – user can write custom business logic according to his requirement to process the data.
Applies to every value in value input

Map produces a new list of key/value pairs

Output of Map is called intermediate output
Can be different type from input pair
Output of map is stored in local disk from where it is shuffled to reduce nodes.

6. Reduce Abstraction

Now let’s discuss the second phase of MapReduce – Reducer, what is given as input to the reducer, what work reducer does, where reducer writes output:

Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually in the reducer, we do aggregation or summation sort of computation.

Input given to reducer is generated by Map (intermediate output)
Key / Value pairs provided to map are sorted by key

Reduce processing

Function defined by user – Here also user can write custom business logic and get the final output.
Iterator supplies the values for a given key to the Reduce function.

Reduce produces final list of key/value pairs

Output of Reduce is called Final output
It can be different type from input pair
Output of Reduce is stored in HDFS

7. How Map and Reduce work Together

Let us understand how map and reduce work together in Hadoop:

Input data given to mapper is processed through user defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as number of mapper are much more than number of reducers. Mapper generates output which is intermediate data and this output goes as input to reducer.

This intermediate result is then processed by user defined function written at reducer and final output is generated. Usually in reducer very light processing is done. This final output is stored at HDFS and replication is done as usual.

8. DataFlow in MapReduce

Now let’s understand complete end to end data flow, how input is given to the mapper, how mappers process data distributedly, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers.

Read the complete article>>

要查看或添加评论，请登录

Santosh Bakliwal的更多文章

Data Science vs Artificial Intelligence – Eliminate your Doubts

2019年6月20日

Data Science vs Artificial Intelligence – Eliminate your Doubts

Data Science and Artificial Intelligence, are the two most important technologies in the world today. While Data…
Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

2019年6月19日

Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

Do you know – We perform 40,000 search queries every second (on Google alone), which makes it 3.5 searches per day and…
Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

2019年6月18日

Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

Data Science Case Study – How Netflix Used Data Science to Improve its Recommendation System? Do you remember the last…
7 Breathtaking Applications of Data Science in Finance

2019年6月17日

7 Breathtaking Applications of Data Science in Finance

1. Objective – Data Science Careers Today, in this tutorial of Future of Data Science, we will discuss what is Data…
Top Data Science Jobs & Roles for 2019: Find What Suits You Best

2019年6月15日

Top Data Science Jobs & Roles for 2019: Find What Suits You Best

“Data Scientist, the sexiest job title for the 21st century” If you have ever witnessed a discussion on data science…
Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

2019年6月14日

Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

Data Science is a massive sector, it is not just one standalone topic but a combination of many. Often, many of us…
Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

2019年6月13日

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

1. Objective In this blog, we will discuss Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning.
20 Interesting Applications of Deep Learning with Python

2019年6月12日

20 Interesting Applications of Deep Learning with Python

1. Top Python Deep Learning Applications Today, in this Deep Learning with Python Tutorial, we will see Applications of…
Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

2019年6月11日

Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

Data Science and Business Analysis are two of the most recurring terms in the industries. Like data scientists…
14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

2019年6月10日

14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

A Data Scientist is responsible for extracting, manipulating, pre-processing and generating predictions out of data. In…

See all articles

The Best Guide to Hadoop MapReduce.

Santosh Bakliwal

Assistant Vice President at DataFlair

1. Objective

2. MapReduce Introduction

3. MapReduce – High level understanding

4. MapReduce Terminologies:

5. Map Abstraction

6. Reduce Abstraction

7. How Map and Reduce work Together

8. DataFlow in MapReduce

Santosh Bakliwal的更多文章

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Hadoop Ecosystem

Getting started with Apache Spark

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Understanding the MapReduce Workflow: A Detailed Guide

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Apache Spark vs. Hadoop MapReduce

Apache Hadoop vs Apache Spark

1. Objective

2. MapReduce Introduction

3. MapReduce – High level understanding

4. MapReduce Terminologies:

5. Map Abstraction

6. Reduce Abstraction

7. How Map and Reduce work Together

8. DataFlow in MapReduce

Santosh Bakliwal的更多文章

Data Science vs Artificial Intelligence – Eliminate your Doubts

Skills Needed to Become a Data Scientist – Learn, Grasp, Implement!

Data Science at Netflix – A Must Read Case Study for Aspiring Data Scientists

7 Breathtaking Applications of Data Science in Finance

Top Data Science Jobs & Roles for 2019: Find What Suits You Best

Data Science Prerequisites – Top Skills Every Data Scientist Need to Have

Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning

20 Interesting Applications of Deep Learning with Python

Data Scientist vs Business Analyst – 5 Core Aspects to Choose Your Career

14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Introduction to Hadoop Ecosystem: Understanding HDFS, MapReduce, and YARN

Hadoop Ecosystem

Getting started with Apache Spark

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Understanding the MapReduce Workflow: A Detailed Guide

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Apache Spark vs. Hadoop MapReduce

Apache Hadoop vs Apache Spark