MapReduce and Iteration
Dr. Alan Dennis
Author, Architect, Thought Leader, Certified Databricks Instructor, and Databricks Solution Architect and Developer Champion
Success can be a curse. The MapReduce framework has become a popular way to process large amounts of data in a batch mode. Because of this popularity, it has become a way to solve problems beyond its initial goal. The challenges associated with iterative computation on the initial implementation of MapReduce are discussed, followed by a cause analysis, and mitigation strategies.
Iterative Computation Challenges with the MapReduce Framework
The initial implementation of MapReduce was intended to perform single pass batch operations (Sakr & Gaber, 2014). For example, when processing a web server's usage log typically processing addresses each record in the log once. The idea of MapReduce is relatively simple. Create a system that manages the set of nodes, so that the developer of an application need not be concerned with that detail. Implement a patter where a developer writes essentially two functions, one to be invoked with raw data, another to be invoked to combine the results of the previous step. It was very successful in that regard. This also means that as it grew in popularity, users of the system wanted to apply it to problems outside of the original vision.
Root Cause
The fundamental issue is that once a map function has completed its operation there is no persisted state of that value, it is essentially discarded. The lineage of values that arrive at a reducer is unknown. The secondary cause is that there did not exist a Big Data platform that was well suited to iterative programming on large data sets or fast processing of individual data items as they arrived (streams).
Mitigation Strategies
One approach to solve this problem was to create a series of MapReduce invocations essentially chaining the results of one pass to the next (Malewicz et al., 2010). Because this approach is relatively computationally expensive, Google developed Pregel to perform iterative processing on graph data (such as when applying the page rank algorithm). The idea was to avoid the communication and serialization overhead associated with producing results, saving them to an intermediate location, and then passing them back to a MapReduce job. Pregel utilized the Google File System and fit into the google ecosphere.
Another approach to addressing the challenges of iterative programming in the MapReduce framework was the use of an alternative technology, such as Spark (Zaharia et al., 2012). Spark can run on top of Hadoop, and in other configurations. Spark utilizes Resilient Distributed Datasets (RDDs) as a way of storing data which is distributed across a cluster. RDDs keep track of the operations performed on the data, such as mapping, filtering, and grouping. Eventually, the contents of the RDD are persisted to a storage system, such as Hadoop Distributed File System (HDFS). The key concept is that the transformations persist, along with the original data, are available to a node performing processing. This design approach greatly improves iterative processing capabilities. Through the use of micro batches, Spark can also be utilized to process data from a stream. The idea is the RDD remains in memory (for performance reasons), and is processed in short batches.
One mitigation strategy utilized was to attempt to extend MapReduce to handle unconventional workloads, such as iterative programs. The motivation for this movement was that organizations invested heavily in Hadoop and stored large amounts of company data in it (Vavilapalli et al., 2013). Developers attempted to utilize Hadoop in interesting ways. For example, they would submit “map only” jobs to the cluster to utilize computational resources. To address the limitations of the original Hadoop implementation, YARN was developed and released. Since its release, it has served as the foundation for Giraph, Spark, Storm, and a host of other applications.
Conclusion
Selecting the appropriate tool for a problem is a critical skill. If a carpenter attempted to build a house using only a handsaw most would not think him an innovative carpenter. Big Data Architects must select the appropriate tools for a given problem. MapReduce is an excellent way to process large amounts of data in a single pass. If iteration is required, it may not be the best choice.
References
Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel: a system for large-scale graph processing. Paper presented at the Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA.
Sakr, S., & Gaber, M. (2014). Large scale and Big Data: Processing and management: CRC Press.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., . . . Baldeschwieler, E. (2013). Apache Hadoop YARN: yet another resource negotiator. Paper presented at the Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara, California.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., . . . Stoica, I. (2012). Fast and interactive analytics over Hadoop data with Spark. USENIX Login, 37(4), 45-51.