登录查看更多内容

When MAPREDUCE is not suitable for processing?

Dinesh Gangwar

Data Warehousing | ETL | SQL | Data Engineer | Snowflake

发布日期: 2021年7月21日

A brief of MapReduce before answering to this question,

MapReduce is a powerful programming model in the Hadoop framework which is used to process and generate big data sets (large volume data) with a parallel and distributed algorithm on a Hadoop cluster. It has two programs - Map and Reduce

Map - it performs complex business logics, filtering, sorting etc.
Reduce - it performs aggregations/summary operations

Image Credit - Edupristine.com

Now coming to question, though MapReduce works great if used as intended but there are few scenarios where it is not suitable or recommended -

领英推荐

Exploring Big Data Architecture Before Spark

Komal Khakal 8 个月前

Hadoop vs Spark Comparison

Dr. Rabi Prasad Padhy 3 年前

Hadoop 3: Comparison with Hadoop 2 and Spark

Igor Bobriakov 6 年前

When response is needed quickly within few seconds i.e. real time processing.
When streaming data needs to be handled. MapReduce is best for batch processing huge amount of data which is already existing on HDFS.
When there is not sufficiently large volume of data which requires distributed system for processing. In this case standalone system would be best choice due to ease of configuring and managing the system.
When need to process Graphs.
When need to process data again and again i.e. iterations. Spark is more suitable in this case.
When lot of data to be shuffled over network during processing.
When there is OLTP need.
When Map phase generates too many keys. In this case sorting takes a lot of time.
When joining two large datasets with complex conditions.

There might be other cases as well. But it depends on how efficiently it is being used.

To discuss further, main reason behind the large response/processing time in MR is that all the intermediate results/data are stored on disk and then to process that intermediate dataset further, it is again read from disk. So there are so many read/write I/O operations happen during MapReduce processing which consume a lot of time.

This problem is taken care in Spark as it processes and keeps all the intermediate results in memory itself. By doing this simple architectural change only Spark became many folds faster than MR. Obviously there are other features in Spark which make it faster but this is very basic change. I can cover Spark in separate article.

#hadoop #mapreduce #bigdata #spark #learning

Satish Prasad

RPA Solutions Consultant specializing in Hyperautomation & Intelligent Automation

3 年

Welcome to club

1 次回应

Vivek Dabas

Associate(Technology) at Goldman Sachs | NSIT'18

3 年

Apache flink overcomes these disadvantages of mapReduce.

1 次回应

查看更多评论

要查看或添加评论，请登录

Dinesh Gangwar的更多文章

Hive Story - History and Architecture

2021年7月25日

Hive Story - History and Architecture

Why was Hive Developed? When Hadoop became popular and was started being used by the organizations, people came across…

1 条评论

When MAPREDUCE is not suitable for processing?

Dinesh Gangwar

Data Warehousing | ETL | SQL | Data Engineer | Snowflake

领英推荐

Dinesh Gangwar的更多文章

社区洞察

其他会员也浏览了

Apache Spark on YARN Architecture

Big Data and MapReduce

Hadoop Cluster Revealed

About Apache Spark, Lightning-fast cluster computing (Big Data)

Relationship between MapReduce, Spark, YARN, and HDFS !

Procedure To Run Tensor-flow on Hadoop

Hadoop 3.0 has been released - but there is a lot more to come with 3.1 and 3.2

HDFS Architecture (Basic concepts)

What is HIVE?

AVRO Serialization vs JSON Serialization

领英推荐

Dinesh Gangwar的更多文章

Hive Story - History and Architecture

社区洞察

其他会员也浏览了

Apache Spark on YARN Architecture

Big Data and MapReduce

Hadoop Cluster Revealed

About Apache Spark, Lightning-fast cluster computing (Big Data)

Relationship between MapReduce, Spark, YARN, and HDFS !

Procedure To Run Tensor-flow on Hadoop

Hadoop 3.0 has been released - but there is a lot more to come with 3.1 and 3.2

HDFS Architecture (Basic concepts)

What is HIVE?

AVRO Serialization vs JSON Serialization