MapReduce and Iteration

Success can be a curse. The MapReduce framework has become a popular way to process large amounts of data in a batch mode. Because of this popularity, it has become a way to solve problems beyond its initial goal. The challenges associated with iterative computation on the initial implementation of MapReduce are discussed, followed by a cause analysis, and mitigation strategies.

Iterative Computation Challenges with the MapReduce Framework

The initial implementation of MapReduce was intended to perform single pass batch operations (Sakr & Gaber, 2014). For example, when processing a web server's usage log typically processing addresses each record in the log once.  The idea of MapReduce is relatively simple. Create a system that manages the set of nodes, so that the developer of an application need not be concerned with that detail. Implement a patter where a developer writes essentially two functions, one to be invoked with raw data, another to be invoked to combine the results of the previous step. It was very successful in that regard. This also means that as it grew in popularity, users of the system wanted to apply it to problems outside of the original vision. 

Root Cause

The fundamental issue is that once a map function has completed its operation there is no persisted state of that value, it is essentially discarded. The lineage of values that arrive at a reducer is unknown. The secondary cause is that there did not exist a Big Data platform that was well suited to iterative programming on large data sets or fast processing of individual data items as they arrived (streams). 

Mitigation Strategies

One approach to solve this problem was to create a series of MapReduce invocations essentially chaining the results of one pass to the next (Malewicz et al., 2010). Because this approach is relatively computationally expensive, Google developed Pregel to perform iterative processing on graph data (such as when applying the page rank algorithm). The idea was to avoid the communication and serialization overhead associated with producing results, saving them to an intermediate location, and then passing them back to a MapReduce job. Pregel utilized the Google File System and fit into the google ecosphere.

Another approach to addressing the challenges of iterative programming in the MapReduce framework was the use of an alternative technology, such as Spark (Zaharia et al., 2012). Spark can run on top of Hadoop, and in other configurations. Spark utilizes Resilient Distributed Datasets (RDDs) as a way of storing data which is distributed across a cluster. RDDs keep track of the operations performed on the data, such as mapping, filtering, and grouping. Eventually, the contents of the RDD are persisted to a storage system, such as Hadoop Distributed File System (HDFS). The key concept is that the transformations persist, along with the original data, are available to a node performing processing. This design approach greatly improves iterative processing capabilities. Through the use of micro batches, Spark can also be utilized to process data from a stream. The idea is the RDD remains in memory (for performance reasons), and is processed in short batches.

One mitigation strategy utilized was to attempt to extend MapReduce to handle unconventional workloads, such as iterative programs. The motivation for this movement was that organizations invested heavily in Hadoop and stored large amounts of company data in it (Vavilapalli et al., 2013). Developers attempted to utilize Hadoop in interesting ways. For example, they would submit “map only” jobs to the cluster to utilize computational resources. To address the limitations of the original Hadoop implementation, YARN was developed and released. Since its release, it has served as the foundation for Giraph, Spark, Storm, and a host of other applications.

Conclusion

Selecting the appropriate tool for a problem is a critical skill. If a carpenter attempted to build a house using only a handsaw most would not think him an innovative carpenter. Big Data Architects must select the appropriate tools for a given problem. MapReduce is an excellent way to process large amounts of data in a single pass. If iteration is required, it may not be the best choice.

  

References

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA.

 

Sakr, S., & Gaber, M. (2014). Large scale and Big Data: Processing and management: CRC Press.

 

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.

 

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., . . . Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark. USENIX Login, 37(4), 45-51.

 

 


要查看或添加评论,请登录

Dr. Alan Dennis的更多文章

  • Assessment of Developing Spark in Databricks and Fabric

    Assessment of Developing Spark in Databricks and Fabric

    Note: The ideas presented within this document are solely those of the author and reflect an assessment at a specific…

    4 条评论
  • Data Models in the Lakehouse

    Data Models in the Lakehouse

    This is an excerpt from an upcoming whitepaper on Lakehouses. Data Models A common question is where do things like…

    1 条评论
  • Lakehouse: You’re probably doing it wrong!

    Lakehouse: You’re probably doing it wrong!

    How the Lakehouse should really work Databricks has been championing the Lakehouse and Medallion Architectures for some…

    1 条评论
  • Learner-Centered Teaching

    Learner-Centered Teaching

    Learner-centered teaching is a form of instruction which focuses on the needs and interests of the learner (KeenGwe…

  • So, what is this cloud thing anyway?

    So, what is this cloud thing anyway?

    You may be wondering why I would write about something as simple as cloud computing. The answer is, that many people…

    1 条评论
  • Complex Event Processing Errors Study

    Complex Event Processing Errors Study

    Hello, As part of my dissertation, I am conducting a study of the types of errors that occur within Complex Event…

  • Cloud Computing, Big Data, and Master Data Management

    Cloud Computing, Big Data, and Master Data Management

    Cloud computing and Big Data have had a tremendous impact. Cloud computing will be discussed in detail, followed by a…

    1 条评论
  • Social Customer Relationship Management

    Social Customer Relationship Management

    Customer Relationship Management (CRM) is a set of people, processes, and technologies which when fully implemented aim…

    1 条评论
  • Zachman’s Framework

    Zachman’s Framework

    Organization of information system and enterprise architectures is an important activity. A notable historical (and…

  • Customer Relationship Management Overview

    Customer Relationship Management Overview

    The industrial age brought about mass production and lowered the cost to produce goods. However, it also made it more…

社区洞察

其他会员也浏览了