登录查看更多内容

What is MapReduce?

Babak Rezaei Bastani

Senior Web Developer

发布日期: 2019年6月30日

MapReduce is a processing method and a Java-based distribution model for distributed computing. The MapReduce algorithm includes two important tasks of Map and Reduce. Map takes a dataset and converts it to another set of data, where single elements are divided into tuples (key/value pairs). Reduce the tasks that receive the output from Map as input and transform tuples into a smaller set of ones. As their name suggests, the Reduce task is always done after the map.

The major advantage of MapReduce is that data processing is easy at more than one computing node. Under the MapReduce model, initial data processing is called mappers and reducers. Dividing a data processing program into mappers and reducers is sometimes important. However, when we wrote an application in the MapReduce form, an application to run on more than hundreds, thousands, or even tens of thousands of devices in a cluster, is merely a configuration change. This scalability is something that many programmers have been attracted to using the MapReduce model.

The Algorithm

*In general, the MapReduce paradigm is based on the submission to the computer where the data is located.

*The MapReduce program runs in three stages, called map stage, shuffle stage, and reduce stage.

** map stage - map or mapper’s job is the processing of input data. In general, the input information is file or directory and is stored in the Hadoop file system (HDFS). The input file is transmitted to the mapper function line by line. Mapper processes data and creates multiple small pieces of data.

**reduce stage - This stage is a combination of the shuffle stage and reduce stage. The Reducer job is processing data that comes from the mapper. After processing, a new output set is generated that is stored in HDFS.

* During the work of MapReduce , the Hadoop sends the map and reduce tasks to the appropriate servers in the cluster.

* This framework manages all the details of the data passing, such as the issuing tasks, verifying task completion and the copying of data in a cluster and between nodes.

* Most computational operations in nodes are done with data on local disks that reduce network traffic.

* Upon completion of the tasks, the cluster attempts to collect and reduce the data in order to produce an appropriate result and send it to the Hadoop server.

要查看或添加评论，请登录

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

2019年7月11日

NameNode Server in HDFS

The main node in HDFS is that it maintains and manages the blocks on the DataNodes. NameNode is a very…
HDFS Architecture (Basic concepts)

2019年7月11日

HDFS Architecture (Basic concepts)

HDFS is a blocked file system in which each file is split into blocks of predefined size. These blocks are stored in…
HDFS goals

2019年6月28日

HDFS goals

Fault detection and recovery : Because HDFS contains a large number of commodity hardware, the probability of failure…
An overview of HDFS

2019年6月28日

An overview of HDFS

The Hadoop file system was developed using distributed file system design and runs on commodity hardware. Unlike other…
Introduction to Hadoop

2019年6月27日

Introduction to Hadoop

Hadoop is an apache-based open source framework written in Java programming language, which allows simple…
Data Science Processing Tools

2019年6月11日

Data Science Processing Tools

Once learned with data storage, you need to be familiar with data processing tools for converting data lakes to data…
Data Warehouse Bus Matrix

2019年6月8日

Data Warehouse Bus Matrix

The Enterprise Bus Matrix is a data warehouse planning tool developed by Ralph Kimball and is being used by numerous…
Data vault

2019年6月8日

Data vault

Data vault modeling, designed by Dan Linstedt, is a database modeling method that has been deliberately structured in…
Data Lake

2019年6月7日

Data Lake

A Data lake is a data storage tank for a large amount of raw data. Waiting for future needs, the data lake saves the…
Data Science Storage Tools

2019年6月6日

Data Science Storage Tools

The data science ecosystem has a set of tools that we use to build our solutions. The capabilities of this environment…

See all articles

What is MapReduce?

Babak Rezaei Bastani

Senior Web Developer

The Algorithm

Babak Rezaei Bastani的更多文章

社区洞察

其他会员也浏览了

Understanding the MapReduce Workflow: A Detailed Guide

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Hadoop MapReduce vs Apache Spark

Massive Dataset Processing: The Power of MapReduce

Spark

MapReduce

Map Reduce

Big Data and MapReduce

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

The Algorithm

Babak Rezaei Bastani的更多文章

NameNode Server in HDFS

HDFS Architecture (Basic concepts)

HDFS goals

An overview of HDFS

Introduction to Hadoop

Data Science Processing Tools

Data Warehouse Bus Matrix

Data vault

Data Lake

Data Science Storage Tools

社区洞察

其他会员也浏览了

Understanding the MapReduce Workflow: A Detailed Guide

Breaking Down Hadoop: How HDFS, MapReduce, and YARN Work Together to Conquer Big Data

Hadoop MapReduce vs Apache Spark

Massive Dataset Processing: The Power of MapReduce

Spark

MapReduce

Map Reduce

Big Data and MapReduce

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem