Intuitive Explanation of "MapReduce"
How many unique words are there in this sentence which you are reading? The answer which you will say is 12 (Note: word ‘are’ is repeated twice) which was rather quite easy. You just counted all those words in that given sentence. Taking this to the next level, if I ask you to count all the words and their frequencies in this post, you’ll have to spend more time and effort doing the same stuff as before, which might look something like this:
- For each new word add a frequency 1 (the Map part)
- For each word that has been seen before, add 1 to the existing value (the Map part)
- At the end of the post, sum up all the values and give the final result (the Reduce part)
While this sounds tedious, it's still achievable. This can be done within a reasonable time frame.
Taking this to the next level, if I ask you to count all the words and their respective frequencies website www.analyticsbot.ml, and if you give me the correct answer within 30 minutes, I will give you a gift. :) You’ll be tempted at first, but there is no way that you can do that manually, all by yourself, within the given time frame. Some of you might feel like this.
If you are one of the clever folks, you might follow the approach below and give back the result well within the time limit. Here’s how you might approach the problem.
- Get the total number of pages on this website and their URLs. Let’s say there are ‘n’ pages.
- You ask ‘n’ of your friends or relatives to take up each a page, and report the result to you, upon which you report the result to me.
- One thing that can make the process fast is if all the same keywords are given to the same person.
- Now each of those ‘n’ friends do the same work that we discussed before and report to you. You, being the manager report the final calculations back to me.
Sounds interesting?
You must have observed the process that were followed in both the steps. Both were same. Now if I ask you to count words and their frequencies on multiple websites, you can easily follow the above approach. Well almost. The only change you might have to do is to increase or decrease the number of friends/relatives depending on the scale of data you are dealing with. Well, the only problem is humans don’t want to do this boring and mundane work. Instead we rely on machines to do this for us. Comes Apache Hadoop MapReduce, the savior.
Taking an example of how the mappers and reducers work in the word count example we discussed above. Refer to this image on the right where we want to count the frequency of words in "Dear Boar River Car Car River Deer Car Bear". The first step is to split the data on difference nodes on which mappers would spit <word, 1> for every word. The next step is to shuffle the same set of keywords to the same reducer which would sum up the counts and send the final results.
Note: This is a part of my earlier post on LinkedIn (some users appreciated the explanation). Also published on my blog analyticsbot.ml