登录查看更多内容

Big Data and MapReduce

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

发布日期: 2022年2月14日

What is "Big Data" and what is “Manageable Data”?

When you’re dealing with data that can fit it into a single machine easily, can be loaded it into memory and you can run all your analysis in a serial fashion – then that data is “manageable” – now imagine you’re dealing with some very large data sets i.e., not talking of hundreds of gigabytes here but terabytes, or maybe even petabytes.?At this point, one might think of the “MapReduce” programming model which allows us to take the data, distribute it across many different machines, and run the computations in parallel.

When is MapReduce Programming model appropriate and when it isn’t?

It's generally safe to say that your data is “big” if it's too large to fit onto one disc. For example, if you have a huge database comprising of several technical reports / research papers of several journals and you want to find out all reports corresponding to a particular topic or which words in the books have appeared most often, the, it might be impossible to load the text from all the journals in the world into a single disc – the MapReduce programming model can be handy here. MapReduce splits a large job up into several smaller chunks that each fit onto one machine and calculations on each machine can occur simultaneously. The machines do not communicate with each other whilst performing calculations

Examples where MapReduce has been used:

Tonnes of online commerce sites – like eBay - use MapReduce to manage their huge amounts of data on sellers, buyers, and transactions.
Some companies use MapReduce to process data to identify malware and cyber-attack patterns
Apixio uses MapReduce to bring advanced data insights to healthcare by extracting and analyzing medical data previously trapped in?electronic health records [https://www.healthcareitnews.com/news/apixio-launches-cognitive-computing-platform]

How does MapReduce work?

MapReduce consists of two distinct tasks – Map and Reduce. As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed.

So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
The output of a Mapper or map job (key-value pairs) is input to the Reducer.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

Figure: How does Map Reduce work?

[source : https://www.edureka.co/blog/mapreduce-tutorial/#what_is_mapreduce]

领英推荐

Discovering the Magic of Big Data with MapReduce…

Pratik Dunakhe 2 个月前

?? Navigating the MapReduce Landscape: A Comprehensive…

Sachin D N ???? 1 年前

Big Data, focusing on MapReduce, Spark, and SQL (Hive).

Amani Begari 2 个月前

Open-source implementation of MapReduce Programming model:

A very common open-source implementation of the map reduce programming model is HADOOP. HADOOP couples the map reduce programming model with a distributed file system.?In order to more easily allow programmers to complete complicated tasks using the processing power of HADOOP, there are many infrastructures out there that built on top of HADOOP.

Two of the most common are Hive and Pig. Hive was initially developed by FaceBook, and one of its biggest selling points is that it allows you to run map-produced jobs through a SQL-like querying language, called the Hive Query Language.?

Pig was originally developed at Yahoo! and excels in some areas Hive does not. Pig jobs are written in a procedural language called pig Latin. Hive and Pig are two of the most common Hadoop-based products, but there are a bunch of them out there. For example, Mahout for machine learning, Giraph for graph analysis, and Cassandra, a hybrid of a key value and a column-oriented database.?

Hive vs Pig:

https://www.projectpro.io/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79

Companies using HADOOP

要查看或添加评论，请登录

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

2025年2月24日

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

1. Introduction: This article is the continuation of my series of articles on “Fine-Tuning of LLMs” and is the fourth…
Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

2025年2月10日

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning of LLMs and is the third blog in the…
Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

2025年2月4日

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

1. Introduction This is the continuation of my series of blogs on Fine-Tuning and is the second blog in the series.
Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

2025年2月1日

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

1. Fine Tuning Series and Background of Transformers and ChatGPT Training Process: One of my earlier series of blogs…
RAG Beyond Basics:

2025年1月7日

RAG Beyond Basics:

1. Introduction: In this article/blog, I will discussing some advanced techniques in the Retrieval-Augmented Generation…
The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

2024年10月24日

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

1. Introduction: The general idea of Retrieval-Augmented Generation (RAGs) is now well understood in LLM community and…

2 条评论
Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

2024年9月23日

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the 14th article in the series.

3 条评论
Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

2024年8月26日

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

1. Introduction: This is the continuation of my Graph Series of Blogs and is the thirteenth blog in the series.
Training Graph Neural Networks: Part 12 of my Graph series of blogs

2024年8月18日

Training Graph Neural Networks: Part 12 of my Graph series of blogs

1. Introduction: This is the continuation of my series of blogs on Graphs and is the twelfth article in the series.
Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

2024年6月30日

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

1. Introduction: This article is the continuation of my series of blogs on “Graphs” and is the eleventh article in the…

See all articles

Big Data and MapReduce

Ajay Taneja

Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics

What is "Big Data" and what is “Manageable Data”?

When is MapReduce Programming model appropriate and when it isn’t?

Examples where MapReduce has been used:

How does MapReduce work?

领英推荐

Open-source implementation of MapReduce Programming model:

Companies using HADOOP

Ajay Taneja的更多文章

社区洞察

其他会员也浏览了

Big Data, focusing on MapReduce, Spark, and SQL (Hive).

Unlocking Big Data’s Potential: The Role of MapReduce, Spark, and SQL (Hive)

Understanding Spark on YARN Architecture

Spark Vs Hadoop Map Reduce

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Understanding the MapReduce Workflow: A Detailed Guide

MapReduce, Spark, and SQL: Transforming Big Data Analytics

Unleashing Big Data: The Power of MapReduce, Spark, and SQL (Hive)

What is "Big Data" and what is “Manageable Data”?

When is MapReduce Programming model appropriate and when it isn’t?

Examples where MapReduce has been used:

How does MapReduce work?

领英推荐

Open-source implementation of MapReduce Programming model:

Companies using HADOOP

Ajay Taneja的更多文章

Low-Rank Adaptation of Large Language Models (LoRA): Part 4 of my Fine-Tuning Series of Blogs

Parameter Efficient Fine Tuning with Additive Adaptation: Part 3 of my Fine-Tuning Series of Blogs

Fine Tuning on Single and Multiple Tasks: Part 2 of my Fine-Tuning Series of Blogs

Essentials of Fine Tuning: Part 1 of my Fine-Tuning Series of Blogs

RAG Beyond Basics:

The Marriage of Retrieval-Augmented Generation (RAGs) with Knowledge Graphs: Part 15 of my Graph Series of Blogs

Knowledge Graph Completion and Knowledge Graph Embeddings: Part 14 of my Graph Series of Blogs

Setting Up Graph Neural Network Prediction Tasks: Part 13 of my Graph Series of Blogs

Training Graph Neural Networks: Part 12 of my Graph series of blogs

Heterogeneous Graphs and Relational Graph Convolutional Neural Networks (RGCNs): Part 11 of my Graph series of blogs

社区洞察

其他会员也浏览了

Big Data, focusing on MapReduce, Spark, and SQL (Hive).

Unlocking Big Data’s Potential: The Role of MapReduce, Spark, and SQL (Hive)

Understanding Spark on YARN Architecture

Spark Vs Hadoop Map Reduce

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Understanding the MapReduce Workflow: A Detailed Guide

MapReduce, Spark, and SQL: Transforming Big Data Analytics

Unleashing Big Data: The Power of MapReduce, Spark, and SQL (Hive)