When MAPREDUCE is not suitable for processing?
Image Credit - Hevodata.com

When MAPREDUCE is not suitable for processing?

A brief of MapReduce before answering to this question,

MapReduce is a powerful programming model in the Hadoop framework which is used to process and generate big data sets (large volume data) with a parallel and distributed algorithm on a Hadoop cluster. It has two programs - Map and Reduce

  • Map - it performs complex business logics, filtering, sorting etc.
  • Reduce - it performs aggregations/summary operations

No alt text provided for this image

Image Credit - Edupristine.com

Now coming to question, though MapReduce works great if used as intended but there are few scenarios where it is not suitable or recommended -

  1. When response is needed quickly within few seconds i.e. real time processing.
  2. When streaming data needs to be handled. MapReduce is best for batch processing huge amount of data which is already existing on HDFS.
  3. When there is not sufficiently large volume of data which requires distributed system for processing. In this case standalone system would be best choice due to ease of configuring and managing the system.
  4. When need to process Graphs.
  5. When need to process data again and again i.e. iterations. Spark is more suitable in this case.
  6. When lot of data to be shuffled over network during processing.
  7. When there is OLTP need.
  8. When Map phase generates too many keys. In this case sorting takes a lot of time.
  9. When joining two large datasets with complex conditions.

There might be other cases as well. But it depends on how efficiently it is being used.

To discuss further, main reason behind the large response/processing time in MR is that all the intermediate results/data are stored on disk and then to process that intermediate dataset further, it is again read from disk. So there are so many read/write I/O operations happen during MapReduce processing which consume a lot of time.

This problem is taken care in Spark as it processes and keeps all the intermediate results in memory itself. By doing this simple architectural change only Spark became many folds faster than MR. Obviously there are other features in Spark which make it faster but this is very basic change. I can cover Spark in separate article.

#hadoop #mapreduce #bigdata #spark #learning

Satish Prasad

RPA Solutions Consultant specializing in Hyperautomation & Intelligent Automation

3 年

Welcome to club

Vivek Dabas

Associate(Technology) at Goldman Sachs | NSIT'18

3 年

Apache flink overcomes these disadvantages of mapReduce.

要查看或添加评论,请登录

Dinesh Gangwar的更多文章

  • Hive Story - History and Architecture

    Hive Story - History and Architecture

    Why was Hive Developed? When Hadoop became popular and was started being used by the organizations, people came across…

    1 条评论

社区洞察

其他会员也浏览了