Understanding MapReduce: A Quick Primer
Nathaniel Payne, PhD (裴内森)
CEO @ CLD | CTO & Managing Partner @ Dygital9 | Managing Partner @ NOQii & Contivos | Associate Partner @ Btriples | Global Connector | Entrepreneur | PhD (AI)
Over the last few months, I have had numerous conversations with individuals covering distributed computing, distributed storage, cluster computing, and of course, the cloud. In my work at Cardinal Path for example, distributed computing is going to be a big focus as we continue to scale our infrastructure to tackle hard applied problems in business. It also happens to be a key focus in my concluding Advanced Operating Systems coursework at Georgia Tech. One term that seems to surface frequently in the context of these discussions is MapReduce. Many individuals that I talk to have a conceptual understanding of what software frameworks like Apache Hadoop, Apache Cassandra, or even Apache Spark offer. That said, there seems to be a continued gap in understanding relating to the fundamental paradigms that drive these types of software frameworks - MapReduce being one of these. Thus, in an effort to clear up some ambiguity, this post serves to discuss key issues relating to the the MapReduce framework, including definitions, benefits, and uses. The post also seeks to provide a tactical example to illustrate this valuable framework in action.
While preparing to write this post, I recently reviewed a fantastic journal article written in 2004 by Jeffrey Dean and Sanjay Ghemawat at Google ([email protected], [email protected]). This article sheds tremendous insight on the concept of MapReduce. While the article is a decade old, their work, entitled "MapReduce: Simplified Data Processing on Large Clusters", was in fact shepherded by Dr. Eric Brewer, an individual who I previously mentioned in my post discussing Giant-Scale architecture.
Setting The Stage: What is Map Reduce?
As noted by Jeffrey and Sanjay, two individuals whom credit for this content is owed, MapReduce is a programming model and an associated implementation for processing and generating large data sets. In this paradigm, users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key.
Where is MapReduce Used at Google?
One of the common questions that many individuals send my way relates to understanding where MapReduce is used in practice. In particular, with my ongoing work on the Google Cloud Platform and Google Compute Engine, people are often interested in learning where MapReduce is used as placed like Google. As noted in the article, MapReduce has been used across a wide range of domains within Google, including:
- Large-scale machine learning applications
- Clustering problems for various Google products
- Extraction of data used to produce reports of popular queries
- Extraction of properties of web pages for new experiments and products (e.g. extraction of geographical locations from a large corpus of web pages for localized search)
- Large-scale graph computations
- Many, many others ...
While the use of MapReduce within a company like Google or Amazon is always evolving, one of the most interesting applications that the authors discuss is the use of MapReduce for "Large Scale Indexing". In fact, at the time of the paper, the authors note that "... one of our most significant uses of MapReduce to date has been a complete rewrite of the production indexing system that produces the data structures used for the Google web search service. The indexing system takes as input a large set of documents that have been retrieved by our crawling system. The raw contents for these documents are more than 20 terabytes of data. The indexing process runs as a sequence five to ten MapReduce operations."
Benefits Of Using MapReduce
Now that you have a clearer understanding on what MapReduce is and where it might be used, it is important to stop and discuss strategically the benefits of using this type of framework. Distributed computing opens up opportunities for data centric organizations who are seeking to attack the challenge of enterprise analytics. That said, this work can be costly and, particularly when speed determines success in the marketplace, must be implemented in a time sensitive manner. Below are some strategic benefits that relate to the use MapReduce. Of course these benefits are relative to a system that does not use distributed computation. That said, I have personally experienced these benefits even this past week distributing and parallelizing computation of some of our most important client focused analysis.
Benefit 1: Less Code
As the authors note, the indexing code that was first used to do MapReduce at Google was simpler, smaller, and easier to understand than other alternatives. While this is a relative benchmark, it is important because less code means potentially lower costs. Within the MapReduce framework, fault tolerance, distribution and parallelization are hidden within the MapReduce library. This means that the the size of one phase of the computation can drop from approximately 3800 lines of C++ code to approximately 700 lines when expressed using MapReduce.
Benefit 2: Fewer Days
Less days spent results in lower costs. At Google, their engineering teams found that the performance of the MapReduce library was good enough that they could keep conceptually unrelated computations separate in processes where they might otherwise need to be tightly coupled. This made it easy to change the indexing process for example and is in line with good software architecture. In fact, as the authors note, one change that took a few months to make in their old indexing system took only a few days to implement in the new MapReduce based system. This translates into freeing up engineering resources to work on more projects and provide more value to your organization. Of course, this can only occur if your team has specialized expertise which can be more expensive. That said, the net benefit in the long term in terms of cost and productivity can be significant.
Benefit 3: Better Performance
As the Google team noted, using a paradigm like MapReduce brought many performance gains. In particular, the indexing process became much easier to operate. This was because most of the problems caused by machine failures, slow machines, and networking hiccups were dealt with automatically by the MapReduce library without operator intervention. In short, this translated to better overall performance on a variety of tasks, which again frees up organizational time and resources to focus on new pressing business and operational challenges
Benefit 4: Faster On-boarding
Depending on the programmer experience in your team, using a framework like MapReduce can provide significant benefits from an on-boarding perspective. Without a doubt, the skills needed to develop a distributed computing environment are specialized and, as a result, are expensive. That said, using the MapReduce model, the Google teams found that it was quite easy to on-board programmers with little distributed computing experience. Of course the base knowledge of systems must be present. That said, the MapReduce framework hides many of the the details of parallelization, fault-tolerance, locality optimization, and load balancing, leaving new programmers to focus on problem specific code rather than system code. Over the last few months, through my work at Georgia Tech, I have personally had to write routines in C that do distributed computation without using available MapReduce libraries. The challenge in terms of time alone and on-boarding were immense to say the least!
Scaleable In Both Use And Application
As was noted, MapReduce enables scalability, something that many organizations are chasing as they seek to future proof themselves. It is something that is prioritized when building giant-scale architectures (for example), and will be addressed in some of my future posts which discuss things like Facebook's Haystack, Amazon's Dynamo, and others. At root, MapReduce is synonymous with scaling because it is designed specifically to scale to large clusters of machines comprising thousands of machines. This makes it suitable for many large problems in workplaces everywhere. Moreover, the fact that a large variety of computational problems are easily expressible as MapReduce is even more relevant. To that end, while MapReduce is used for the generation of data for Google's production web search service, for sorting, for data mining, for machine learning, and many other systems, the same functionality could easily be applied to many other problems in other organizations.
So, How Does Map Reduce Really Work?
Now that we have cemented our higher level understanding of MapReduce, it is time to quickly survey the tactics of MapReduce. This section is kept light (please see the paper for more information as credit for this example goes to Jeffrey and Sanjay). That said, the goal here is to enable more technically interested users to understand at a high level using a text analysis example how MapReduce really works. Within the MapReduce process, the programmer takes a set of input key/value pairs and produces a set of output key/value pairs. To do this, the programmer using the MapReduce library expresses the computation using two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library then groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function. The Reduce function, also written by the programmer, then accepts an intermediate key as well as a set of values for that key. The function then merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator.
Below is an example that the author's provide which shows how the Map and Reduce functions could be used to count the number of occurrences of each word in a large collection of text.
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
In this example, the Map function emits each word plus an associated count of occurrences (just `1' in this simple example). On the other hand, the Reduce function sums together all counts emitted for a particular word. In addition, the programmer writes code to fill in a MapReduce specification object with the names of the input and output, as well as optional tuning parameters. The programmer then invokes the MapReduce function, passing it the specification object. The programmer's code is linked together with the MapReduce library (implemented in C++).
How Did The Authors Specifically Execute MapReduce At Google
Below are a few bullets that outline how the authors set up and executed MapReduce at Google. I felt it important to include because it helps those thinking of setting up similar processes on a smaller scale understand the general approach and framework that will lead towards success::
- Machines involved are typically dual-processor x86 processors running Linux, with 2-4 GB of memory per machine.
- Commodity networking hardware is used typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth.
- A cluster is created consists of hundreds or thousands of machines, with machine failures being common. Of course one could do less in principle. That said, this is Google's use case.
- Storage is provided by inexpensive IDE disks attached directly to individual machines. A distributed file system developed in-house at Google is used to manage the data stored on these disks. The file system uses replication to provide availability and reliability on top of unreliable hardware.
- Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.
In this architecture, the Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions
(R) and the partitioning function are specified by the user.
Performance Evaluation And Final Thoughts
The paper includes so much detail that discussing it, as well as work that has evolved from it, could spawn multiple additional blogs. That said, the goal here was to simply articulate from a higher level some fundamental insights into MapReduce and resolve ambiguity. Importantly, the paper does close with the authors running a number of major experiments, as well as one benchmark test to review machine failure. Reviewing these experiments is very useful as a validation step, particularly if one wishes to understand how MapReduce jobs perform comparatively. At a high level, the author's experiments do demonstrate MapReduce's ability to search through large amounts of data while looking for a particular pattern, as well as its ability to sort large amounts of data. Finally, in the third test, the authors show an execution of the sort program where they intentionally kill 200 out of 1746 worker processes several minutes into the computation. The underlying cluster scheduler immediately restarts new worker processes on these machines (since only the processes were killed, the machines were still functioning properly), thus demonstrating the failure tolerance of MapReduce.
Without a doubt, MapReduce and other related distributed computing frameworks and paradigms continue to evolve at a fast pace. Moreover, distributed computing is a hot topic right now that will continue to gain more steam. In fact, this year, Google Code Jam "announced the introduction of a brand new track within the Code Jam competition called Distributed Code Jam. This section is designed to test participants knowledge of distributed coding, latency reduction abilities, and of course, algorithmic coding skills. Winners will have a chance to attend the finals in Seattle, Washington and potentially win both Code Jam and Distributed Code Jam" (which would be an incredible feat!) If you are ever looking to chat on these topics, share ideas, or talk about your own implementation experiences, please don't hesitate to reach out. 2015 will be the year I focus deeply on distributed computing in the cloud, and I look forward to working with many peers in industry who are each trying to tackle tough engineering and data science / machine learning problems whose solutions have a chance to change the face of nearly every industry imaginable!