The Best Guide to Hadoop MapReduce.

The Best Guide to Hadoop MapReduce.

1. Objective

This MapReduce tutorial describes all the concepts of MapReduce in great details. Understand what is Mapper, Reducer, shuffling, sorting, etc. This comprehensive Guide of MapReduce also covers internals of MapReduce, DataFlow, architecture, Data locality as well.

2. MapReduce Introduction

Let’s discuss what is MapReduce, how it divides the work into sub-work, why MapReduce is one of the best paradigm to process data distributedly:

Map Reduce is the processing layer of Hadoop. MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You just need to put business logic in the way map reduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by user to master is divided into small small works (tasks) and assigned to slaves.

MapReduce programs are written in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. Here in map reduce we get input as a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to map reduce as here parallel processing is done.

3. MapReduce – High level understanding

Let’s understand basic of MapReduce, at high level how MapReduce looks like, what, why and how of MapReduce

Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers. Problem is divided into large number of smaller problems each of which are processed independently to give individual outputs. These individual outputs are further processed to give final output.

Map-Reduce is highly scalable and can be used across many computers. Many small machines can be used to process jobs that normally could not be processed by a large machine.

4. MapReduce Terminologies:

Understand different terminologies and concepts of MapReduce, what is Map, what is reduce, what is job, task, task attempt, etc.

Map-Reduce is the data processing component of Hadoop. Conceptually, Map-Reduce programs transform lists of input data elements into lists of output data elements. A Map-Reduce program will do this twice, using two different list processing idioms

  • Map
  • Reduce

In between Map and Reduce, there is small phase called Shuffle and Sort.

Let’s understand basic terminologies used in Map Reduce.

Job – A “full program” – an execution of a Mapper and Reducer across a data set. It is execution of 2 processing layers ie mapper and reducer.

A mapreduce job is the work that the client wants to be performed. It consists of the input data, the MapReduce Program and configuration info. So client need to submit input data, he need to write Map Reduce program and set the configuration info (These were provided during hadoop setup in configuration file and also we specify some configurations in our program itself which will be specific to our map reduce job).

Task – An execution of a Mapper or a Reducer on a slice of data .It is also called Task-In-Progress (TIP). It means processing of data is in progress either on mapper or reducer .

Task Attempt – A particular instance of an attempt to execute a task on a node. There is a possibility that anytime any machine can go down. For example while processing data if any node goes down, framework reschedules the task to some other node. This rescheduling of task cannot be infinite. There is an upper limit for that as well. Default value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then the job is considered as a failed job. For high priority job or huge job, the value of this task attempt can be increased as well.

Install Hadoop and play with MapReduce.

5. Map Abstraction

Let us understand abstract form of Map, the first phase of MapReduce paradigm, what is a map / mapper, what is the input to the mapper, how it processes the data, what is output from the mapper:

Map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.

  • Key is a reference to the input value
  • Value is the data set on which to operate

Map Processing:

  • Function defined by user – user can write custom business logic according to his requirement to process the data.
  • Applies to every value in value input

Map produces a new list of key/value pairs

  • Output of Map is called intermediate output
  • Can be different type from input pair
  • Output of map is stored in local disk from where it is shuffled to reduce nodes.

6. Reduce Abstraction

Now let’s discuss the second phase of MapReduce – Reducer, what is given as input to the reducer, what work reducer does, where reducer writes output:

Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually in the reducer, we do aggregation or summation sort of computation.

  • Input given to reducer is generated by Map (intermediate output)
  • Key / Value pairs provided to map are sorted by key

Reduce processing

  • Function defined by user – Here also user can write custom business logic and get the final output.
  • Iterator supplies the values for a given key to the Reduce function.

Reduce produces final list of key/value pairs

  • Output of Reduce is called Final output
  • It can be different type from input pair
  • Output of Reduce is stored in HDFS

7. How Map and Reduce work Together

Let us understand how map and reduce work together in Hadoop:

Input data given to mapper is processed through user defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as number of mapper are much more than number of reducers. Mapper generates output which is intermediate data and this output goes as input to reducer.

This intermediate result is then processed by user defined function written at reducer and final output is generated. Usually in reducer very light processing is done. This final output is stored at HDFS and replication is done as usual.

8. DataFlow in MapReduce

Now let’s understand complete end to end data flow, how input is given to the mapper, how mappers process data distributedly, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers.

Read the complete article>>

要查看或添加评论,请登录

Santosh Bakliwal的更多文章

社区洞察

其他会员也浏览了