登录查看更多内容

Intuitive Explanation of "MapReduce"

Ravi Shankar

Machine Learning Manager | RecSys, LLM, CV, NLP | Scalable AI/ML

发布日期: 2017年2月21日

How many unique words are there in this sentence which you are reading? The answer which you will say is 12 (Note: word ‘are’ is repeated twice) which was rather quite easy. You just counted all those words in that given sentence. Taking this to the next level, if I ask you to count all the words and their frequencies in this post, you’ll have to spend more time and effort doing the same stuff as before, which might look something like this:

For each new word add a frequency 1 (the Map part)
For each word that has been seen before, add 1 to the existing value (the Map part)
At the end of the post, sum up all the values and give the final result (the Reduce part)

While this sounds tedious, it's still achievable. This can be done within a reasonable time frame.

Taking this to the next level, if I ask you to count all the words and their respective frequencies website www.analyticsbot.ml, and if you give me the correct answer within 30 minutes, I will give you a gift. :) You’ll be tempted at first, but there is no way that you can do that manually, all by yourself, within the given time frame. Some of you might feel like this.

If you are one of the clever folks, you might follow the approach below and give back the result well within the time limit. Here’s how you might approach the problem.

Get the total number of pages on this website and their URLs. Let’s say there are ‘n’ pages.
You ask ‘n’ of your friends or relatives to take up each a page, and report the result to you, upon which you report the result to me.
One thing that can make the process fast is if all the same keywords are given to the same person.
Now each of those ‘n’ friends do the same work that we discussed before and report to you. You, being the manager report the final calculations back to me.

Sounds interesting?

You must have observed the process that were followed in both the steps. Both were same. Now if I ask you to count words and their frequencies on multiple websites, you can easily follow the above approach. Well almost. The only change you might have to do is to increase or decrease the number of friends/relatives depending on the scale of data you are dealing with. Well, the only problem is humans don’t want to do this boring and mundane work. Instead we rely on machines to do this for us. Comes Apache Hadoop MapReduce, the savior.

Taking an example of how the mappers and reducers work in the word count example we discussed above. Refer to this image on the right where we want to count the frequency of words in "Dear Boar River Car Car River Deer Car Bear". The first step is to split the data on difference nodes on which mappers would spit <word, 1> for every word. The next step is to shuffle the same set of keywords to the same reducer which would sum up the counts and send the final results.

Note: This is a part of my earlier post on LinkedIn (some users appreciated the explanation). Also published on my blog analyticsbot.ml

要查看或添加评论，请登录

Ravi Shankar的更多文章

How I started with Deep Learning?

2017年5月22日

How I started with Deep Learning?

Note: In this post, I talk about my learning in deep learning, the courses I took to understand, and the widely used…

4 条评论
Measuring Text Similarity in Python

2017年5月15日

Measuring Text Similarity in Python

Note: This article has been taken from a post on my blog. A while ago, I shared a paper on LinkedIn that talked about…

1 条评论
Getting started with Apache Spark

2017年4月14日

Getting started with Apache Spark

If you are in the big data space, you must have head of these two Apache Projects – Hadoop & Spark. To read more on…
Getting started with Hadoop

2017年2月15日

Getting started with Hadoop

Note: This is a long post. It talks about big data as a concept, what is Apache Hadoop, "Hello World" program of Hadoop…

7 条评论
What is the Most Complex thing in the Universe?

2017年2月5日

What is the Most Complex thing in the Universe?

What is the most complex piece of creation (natural/artificial) in this universe? Is it the human brain? But if the…

11 条评论
Automate Finding Items on Craigslist || Python & Selenium to the Rescue

2017年1月28日

Automate Finding Items on Craigslist || Python & Selenium to the Rescue

If necessity is the mother of invention, then laziness is sometimes its father! Craigslist, especially in the United…

7 条评论
Getting Started with Python!

2017年1月23日

Getting Started with Python!

Note: This post is only for Python beginners. If you are comfortable with it, there might be nothing new to learn.

2 条评论
L1, L2 Regularization – Why needed/What it does/How it helps?

2017年1月14日

L1, L2 Regularization – Why needed/What it does/How it helps?

Simple is better! That’s the whole notion behind regularization. I recently wrote about Linear Regression and Bias…

4 条评论
Bias-Variance Tradeoff: What is it and why is it important?

2017年1月3日

Bias-Variance Tradeoff: What is it and why is it important?

What is Bias- Variance Tradeoff? The bias-variance tradeoff is an important aspect of machine/statistical learning. All…

7 条评论
Understanding Linear Regression

2016年12月25日

Understanding Linear Regression

In my recent post on my blog, I tried to present my understanding of linear regression with charts and tables. Here's…

See all articles

Intuitive Explanation of "MapReduce"

Ravi Shankar

Machine Learning Manager | RecSys, LLM, CV, NLP | Scalable AI/ML

Ravi Shankar的更多文章

社区洞察

其他会员也浏览了

INBOX INSIGHTS: 5P Framework Flexibility, Marketing Hiring Data

Data Can Point But it Can't Touch

Data Can Point But it Can't Touch

OS AI: Delivering Fresh Competitive Intelligence - October 15, 2024

Analysing Open-ended Comments (almost) for free

Claude Shannon Used Markov Chains. Why?

Unlock the Power of Insights with Our Special Offer on Insights360

Throwing shades at analysts: what are figures capable of

Data drift at Equifax may have impacted credit scores

Solicitation Responses at 30,000 Feet

Ravi Shankar的更多文章

How I started with Deep Learning?

Measuring Text Similarity in Python

Getting started with Apache Spark

Getting started with Hadoop

What is the Most Complex thing in the Universe?

Automate Finding Items on Craigslist || Python & Selenium to the Rescue

Getting Started with Python!

L1, L2 Regularization – Why needed/What it does/How it helps?

Bias-Variance Tradeoff: What is it and why is it important?

Understanding Linear Regression

社区洞察

其他会员也浏览了

INBOX INSIGHTS: 5P Framework Flexibility, Marketing Hiring Data

Data Can Point But it Can't Touch

Data Can Point But it Can't Touch

OS AI: Delivering Fresh Competitive Intelligence - October 15, 2024

Analysing Open-ended Comments (almost) for free

Claude Shannon Used Markov Chains. Why?

Unlock the Power of Insights with Our Special Offer on Insights360

Throwing shades at analysts: what are figures capable of

Data drift at Equifax may have impacted credit scores

Solicitation Responses at 30,000 Feet