Big Data and MapReduce
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
What is "Big Data" and what is “Manageable Data”?
When you’re dealing with data that can fit it into a single machine easily, can be loaded it into memory and you can run all your analysis in a serial fashion – then that data is “manageable” – now imagine you’re dealing with some very large data sets i.e., not talking of hundreds of gigabytes here but terabytes, or maybe even petabytes.?At this point, one might think of the “MapReduce” programming model which allows us to take the data, distribute it across many different machines, and run the computations in parallel.
When is MapReduce Programming model appropriate and when it isn’t?
It's generally safe to say that your data is “big” if it's too large to fit onto one disc. For example, if you have a huge database comprising of several technical reports / research papers of several journals and you want to find out all reports corresponding to a particular topic or which words in the books have appeared most often, the, it might be impossible to load the text from all the journals in the world into a single disc – the MapReduce programming model can be handy here. MapReduce splits a large job up into several smaller chunks that each fit onto one machine and calculations on each machine can occur simultaneously. The machines do not communicate with each other whilst performing calculations
Examples where MapReduce has been used:
How does MapReduce work?
MapReduce consists of two distinct tasks – Map and Reduce. As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed.
Figure: How does Map Reduce work?
领英推荐
Open-source implementation of MapReduce Programming model:
A very common open-source implementation of the map reduce programming model is HADOOP. HADOOP couples the map reduce programming model with a distributed file system.?In order to more easily allow programmers to complete complicated tasks using the processing power of HADOOP, there are many infrastructures out there that built on top of HADOOP.
Two of the most common are Hive and Pig. Hive was initially developed by FaceBook, and one of its biggest selling points is that it allows you to run map-produced jobs through a SQL-like querying language, called the Hive Query Language.?
Pig was originally developed at Yahoo! and excels in some areas Hive does not. Pig jobs are written in a procedural language called pig Latin. Hive and Pig are two of the most common Hadoop-based products, but there are a bunch of them out there. For example, Mahout for machine learning, Giraph for graph analysis, and Cassandra, a hybrid of a key value and a column-oriented database.?
Hive vs Pig:
Companies using HADOOP