HBase MapReduce Integration
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
What is MapReduce?
In order to solve the problem of processing in excess of terabytes of data in a scalable way, MapReduce process was designed. However, to build such a system that increases in performance linearly with the number of physical machines added, there should be a proper way. Basically, this is what the main purpose of MapReduce.
Let’s revise HBase Architecture
Moreover, by splitting the data located on a distributed file system, it follows a divide-and-conquer approach. Hence, the servers which are available can access these chunks of data and also can process them as fast as they can. However, we will have to consolidate the data at the end with this approach. So, MapReduce has this built right into it, again.
Classes
Here in the above MapReduce process figure, all the classes which are involved in the Hadoop implementation of MapReduce, is shown, let’s learn them in detail:
i. InputFormat
At very first, InputFormat splits the input data and further returns a RecordReader instance which defines the classes of the key and value objects. Also, it helps to iterate over each input record, with the help of next() method.
ii. Mapper
Now, by using the map() method, each record read using the RecordReader is processed, in this step.
iii. Reducer
This stage is as same as Mapper stage. Here we use to process the output of a Mapper class after shuffling and sorting of data
iv. OutputFormat
Finally, OutputFormat class hold the data in various locations. Here are some specific implementations which allow output to files, or in the case of the TableOutputFormat class to HBase tables. Moreover, to write the data into the specific HBase output table, it uses a TableRecord Writer.
Supporting Classes in MapReduce Integration
Now, in setting up MapReduce jobs over HBase, the MapReduce support comes with the TableMapReduceUtil class. There are some static methods which help to configure a job, hence we can run it with HBase as the source and/or the target.
Let’s revise HBase Use Cases and Real-time Applications