Using Aggregation Framework and creating Mapper and Reducer Program
Vivek Sharma
RedHat Certified in Containers and Kubernetes(EX180) ||Arth || Aspiring DevOps Engineer || Python || Docker || Ansible || Jenkins || Kubernetes || OpenShift || Terraform || GitHub || AWS || AZURE
ARTH - Task 34 ???????
Task Description --- ?? Use Aggression Framework of MongoDB and Create Mapper and Reducer Program.
MongoDB is the first database which comes to mind when we have work on unstructured data and manipulate the shape of data quickly and efficiently. For this MongoDB comes with a powerful framework which the Aggregation Framework to manipulate data effectively in the server itself.
What is Aggregation Framework ?
The Aggregation framework is just a way to query documents in a collection in MongoDB. This framework exists because when you start working with and manipulating data, you often need to crunch collections together, modify them, pluck out fields, rename fields, concatinate them together, group documents by field, explode array of fields in different documents and so on.
The simple query set in MongoDB only allows you to retrieve full or parts of individual documents. They don't really allow you to manipulate the documents on the server and then return them to your application. This is where the aggregation framework from MongoDB comes in. It's nothing external, as aggregation comes baked into MongoDB.?
What is Pipeline in Aggregation Framework ?
The Aggregation framework relies on the pipeline concept. The pipeline consists of certain stages where certain?operators?modify the documents in the collection using various techniques. Finally, the output is returned to the application calling the query.
In MongoDB, the pipeline is an array consisting of various operators, which take in a bunch of documents and spit out modified documents according to the rules specified by the programmer. The next operator takes in the documents spat out by the previous operator, hence, it's called a pipeline. Some of the example of operators are $group , $match , $project, $sort etc.
Comparing it with the query like find() , which works in most of the cases but not handy when we want to modify and retrieve data at the same time.
Some Practical Demonstrations of Using Aggregation Pipelines
Here I have a collection of dummy data of user of a company with complete details like name, age , gender, state etc. In this article I will be using this type of collection.
Let's create a Pipeline to determine the number of male in each state.
> db.contacts.aggregate([
... {$match : {gender : "male"}},
... {$group : {_id: {state : "$location.state"} , total_males : {$sum : 1} }} ])
The $match operator will collect all the documents with "gender" field as "male" and pass to next operator. Next operator is $group which will group the documents according to their state and also keep the count of total males in that state .
What is Map Reduce ?
MapReduce?is a programming paradigm that works on a big data over distributed system. It analysis data and produce aggregated results. Key / values pairs have declared in the map function which we use this values to accumulate data. Later in reduce function we use this accumulated data, accumulated in the map function, to convert them into the aggregated results. The Mapper and Reducer Programs are written in JavaScript language.
Let's create a Map-Reduce program to calculate average age of all males and females -
> var mapfun = function() {
... emit(this.gender , this.dob.age);
... };
The above code is a map function to input documents. This above function "mapfun" map the gender of the person to age of that person. 'this' keyword used .
> var reducefun = function(gneder , age) {
... return Array.avg(age);
领英推荐
... };
This is the code of Reducer function to process the incoming data . The function "reducefun" has two arguments - 'age' argument is array of the value of 'dob.age' emitted by Mapper function which are grouped by 'gender' argument. This function returns the average of ages of all the males and females in that particular collection.
Next step is to put these function in the mapReduce function .
> db.contacts.mapReduce(
... mapfun,
... reducefun,
... {out : "map_reduce11"}
... )
The function 'mapfun' will take input documents from 'contacts' collection and pass to the Reducer function 'reducefun' which then calculate the average of male and female ages. Next the 'out' keyword receive the data form 'reducerfun' and saves them to "map_reduce11" collection . If their is no collection of the given name then new collection will be created and data will be stored in that . If the collection already exits with some other data then older data will be overwritten.
We can see whether new collection is created or not using (>show collections ) command and then view data in that collection using (>db.map_reduce11.find() ) command.
Let's do the same task using the Aggregation framework -
> db.contacts.aggregate([
... {$group : {_id : "$gender" , value: {$avg : "$dob.age"} }},
... {$out : "aggregation11"}
... ])
The above code I have first grouped all the documents on the gender basis then found the average of their ages. In the next step I have saved the results in the "aggregation11" collection.
we can view the data in the collection using the command "db.aggregation.find()" command.
Example -2 : Here I am going to create a Map-Reduce program to calculate the number of males and females in a state.
> var mapfun22 = function() {
emit(this.location.state , this.gender);
};
This Mapper function will collection the Map the gender of each document with the name of it's state.
> var reducerfun22 = function(state ,gender){
var result = {male : 0 , female : 0};
for (var idx =0 ; idx < gender.length ; ++idx){
if(gender[idx] == "male")? result.male++;
else? result.female++ ;
}
return result;
};
The above reducer function will count the number of males and females in a state and return results to next statement.
> db.contacts.mapReduce(
mapfun22,
reducerfun22,
{out : "map_reduce22"}
)
Here we have clubbed the Mapper and Reducer functions and saved final output to "map_reduce22" collection.
To view the results of the program we have to use command (>db.map_reduce22.find() )--
Here _id is equivalent to "location.state" field of the original dataset.
THANK YOU FOR READING.....