登录查看更多内容

MapReduce and Iteration

Dr. Alan Dennis

Author, Architect, Thought Leader, Certified Databricks Instructor, and Databricks Solution Architect and Developer Champion

发布日期: 2017年7月23日

Success can be a curse. The MapReduce framework has become a popular way to process large amounts of data in a batch mode. Because of this popularity, it has become a way to solve problems beyond its initial goal. The challenges associated with iterative computation on the initial implementation of MapReduce are discussed, followed by a cause analysis, and mitigation strategies.

Iterative Computation Challenges with the MapReduce Framework

The initial implementation of MapReduce was intended to perform single pass batch operations (Sakr & Gaber, 2014). For example, when processing a web server's usage log typically processing addresses each record in the log once. The idea of MapReduce is relatively simple. Create a system that manages the set of nodes, so that the developer of an application need not be concerned with that detail. Implement a patter where a developer writes essentially two functions, one to be invoked with raw data, another to be invoked to combine the results of the previous step. It was very successful in that regard. This also means that as it grew in popularity, users of the system wanted to apply it to problems outside of the original vision.

Root Cause

The fundamental issue is that once a map function has completed its operation there is no persisted state of that value, it is essentially discarded. The lineage of values that arrive at a reducer is unknown. The secondary cause is that there did not exist a Big Data platform that was well suited to iterative programming on large data sets or fast processing of individual data items as they arrived (streams).

Mitigation Strategies

One approach to solve this problem was to create a series of MapReduce invocations essentially chaining the results of one pass to the next (Malewicz et al., 2010). Because this approach is relatively computationally expensive, Google developed Pregel to perform iterative processing on graph data (such as when applying the page rank algorithm). The idea was to avoid the communication and serialization overhead associated with producing results, saving them to an intermediate location, and then passing them back to a MapReduce job. Pregel utilized the Google File System and fit into the google ecosphere.

Another approach to addressing the challenges of iterative programming in the MapReduce framework was the use of an alternative technology, such as Spark (Zaharia et al., 2012). Spark can run on top of Hadoop, and in other configurations. Spark utilizes Resilient Distributed Datasets (RDDs) as a way of storing data which is distributed across a cluster. RDDs keep track of the operations performed on the data, such as mapping, filtering, and grouping. Eventually, the contents of the RDD are persisted to a storage system, such as Hadoop Distributed File System (HDFS). The key concept is that the transformations persist, along with the original data, are available to a node performing processing. This design approach greatly improves iterative processing capabilities. Through the use of micro batches, Spark can also be utilized to process data from a stream. The idea is the RDD remains in memory (for performance reasons), and is processed in short batches.

One mitigation strategy utilized was to attempt to extend MapReduce to handle unconventional workloads, such as iterative programs. The motivation for this movement was that organizations invested heavily in Hadoop and stored large amounts of company data in it (Vavilapalli et al., 2013). Developers attempted to utilize Hadoop in interesting ways. For example, they would submit “map only” jobs to the cluster to utilize computational resources. To address the limitations of the original Hadoop implementation, YARN was developed and released. Since its release, it has served as the foundation for Giraph, Spark, Storm, and a host of other applications.

Conclusion

Selecting the appropriate tool for a problem is a critical skill. If a carpenter attempted to build a house using only a handsaw most would not think him an innovative carpenter. Big Data Architects must select the appropriate tools for a given problem. MapReduce is an excellent way to process large amounts of data in a single pass. If iteration is required, it may not be the best choice.

References

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA.

Sakr, S., & Gaber, M. (2014). Large scale and Big Data: Processing and management: CRC Press.

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., . . . Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark. USENIX Login, 37(4), 45-51.

要查看或添加评论，请登录

Dr. Alan Dennis的更多文章

Assessment of Developing Spark in Databricks and Fabric

2025年3月15日

Assessment of Developing Spark in Databricks and Fabric

Note: The ideas presented within this document are solely those of the author and reflect an assessment at a specific…

4 条评论
Data Models in the Lakehouse

2024年10月15日

Data Models in the Lakehouse

This is an excerpt from an upcoming whitepaper on Lakehouses. Data Models A common question is where do things like…

1 条评论
Lakehouse: You’re probably doing it wrong!

2024年9月26日

Lakehouse: You’re probably doing it wrong!

How the Lakehouse should really work Databricks has been championing the Lakehouse and Medallion Architectures for some…

1 条评论
Learner-Centered Teaching

2019年7月2日

Learner-Centered Teaching

Learner-centered teaching is a form of instruction which focuses on the needs and interests of the learner (KeenGwe…
So, what is this cloud thing anyway?

2019年1月11日

So, what is this cloud thing anyway?

You may be wondering why I would write about something as simple as cloud computing. The answer is, that many people…

1 条评论
Complex Event Processing Errors Study

2018年7月20日

Complex Event Processing Errors Study

Hello, As part of my dissertation, I am conducting a study of the types of errors that occur within Complex Event…
Cloud Computing, Big Data, and Master Data Management

2018年5月31日

Cloud Computing, Big Data, and Master Data Management

Cloud computing and Big Data have had a tremendous impact. Cloud computing will be discussed in detail, followed by a…

1 条评论
Social Customer Relationship Management

2018年5月22日

Social Customer Relationship Management

Customer Relationship Management (CRM) is a set of people, processes, and technologies which when fully implemented aim…

1 条评论
Zachman’s Framework

2018年4月25日

Zachman’s Framework

Organization of information system and enterprise architectures is an important activity. A notable historical (and…
Customer Relationship Management Overview

2018年4月11日

Customer Relationship Management Overview

The industrial age brought about mass production and lowered the cost to produce goods. However, it also made it more…

See all articles

MapReduce and Iteration

Dr. Alan Dennis

Author, Architect, Thought Leader, Certified Databricks Instructor, and Databricks Solution Architect and Developer Champion

Iterative Computation Challenges with the MapReduce Framework

Root Cause

Mitigation Strategies

Conclusion

References

Dr. Alan Dennis的更多文章

社区洞察

其他会员也浏览了

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Map Reduce

The future of Apache Spark

What is Beyond Classic Hadoop? Is it Spark & Flink?

Relationship between MapReduce, Spark, YARN, and HDFS !

The Best Guide to Hadoop MapReduce.

MapReduce

Iterative Computation Challenges with the MapReduce Framework

Root Cause

Mitigation Strategies

Conclusion

References

Dr. Alan Dennis的更多文章

Assessment of Developing Spark in Databricks and Fabric

Data Models in the Lakehouse

Lakehouse: You’re probably doing it wrong!

Learner-Centered Teaching

So, what is this cloud thing anyway?

Complex Event Processing Errors Study

Cloud Computing, Big Data, and Master Data Management

Social Customer Relationship Management

Zachman’s Framework

Customer Relationship Management Overview

社区洞察

其他会员也浏览了

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Map Reduce

The future of Apache Spark

What is Beyond Classic Hadoop? Is it Spark & Flink?

Relationship between MapReduce, Spark, YARN, and HDFS !

The Best Guide to Hadoop MapReduce.

MapReduce