HBase Tutorial
Table of Contents
Have you ever wondered how emails are stored and processed? Back in the days before the advent of emails, RDBMS was used to store data. However,? with the rise of massive amounts of semi-structured data like emails, RDBMS failed to store and process this data. And this task was taken up by HBase.The Hadoop ecosystem consists of various units that are dedicated for different roles - one of which is HBase.
Introduction to HBase
A few decades ago, the internet wasn’t available, that is also when the data generated was much lesser and also was structured in nature. Structured data means data that has a definite structure and which has a standard order. This data was stored in the Relational Database (RDBMS) without any hassle.?
With the evolution of the internet, we heard terms such as big data where huge volumes of structured and semi-structured data started getting generated. Semi-structured data includes your emails, JSON, XML, and .csv files to name a few. Loads of semi-structured data was created across the globe. As a result, storing and processing this data became a major challenge. And the solution?- Apache HBase. Let’s now have a look at the history of HBase.
HBase History
Back in November 2006, Google released the paper on BigTable. Then in February 2007, the HBase prototype was created as a Hadoop contribution. In October 2007, the first usable HBase along with Hadoop 0.15.0 was released, and HBase became the subproject of Hadoop in January 2008. HBase 0.81.1, 0.19.0 and 0.20.0 were released between Oct 2008 and Sep 2009. Finally, in May 2010, HBase became Apache top-level project.
What is HBase?
HBase is modeled after Google's Bigtable, which is a distributed storage system for structured data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Some of the companies that use HBase as their core program are Facebook, Netflix, Yahoo, Adobe, and Twitter. The goal of HBase is to host large tables with billions of rows and millions of columns on top of clusters of commodity hardware.
Why HBase?
We know that HDFS stores, processes, and manages large amounts of data efficiently.
However, it performs only batch processing where the data is accessed in a sequential manner. This means one has to search the entire dataset for even the simplest of jobs. Hence, a solution was required to access, read, or write data any time regardless of its sequence in the clusters of data.
HBase Real Life Connect - Example
You may be aware that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, and on-site Facebook messages. They need to store over 135 billion messages a month.
Facebook chose HBase because it needed a system that could handle two types of data patterns:
Characteristics of HBase
HBase is a type of NoSQL database and is classified as a key-value store. Some characteristics of HBase are:
HBase is a database in which tables have no schema; column families and not columns are defined at the time of table creation.
Applications of HBase?
There are a number of HBase applications across various industries, from healthcare to e-commerce to sports sector. For instance:
HBase vs RDBMS
Does HBase and RDBMS sound similar? Here are some of? the primary differences between them.
HBase
RDBMS
Features of HBase?
HBase has a number of features like:
HBase Architecture
The architecture of HBase is as shown below:
The Apache Zookeeper monitors the system, and the HBase Master assigns regions and load balancing. The Region server serves data to read and write. The Region Server is all the different computers in the Hadoop cluster. It consists of Region, HLog, Store, MemoryStore, and different files. All this is a part of the HDFS storage system. Let’s now move and have an in-depth knowledge of each of these architectural components and see how it works together.
HBase Architectural Components: Regions
As seen in the below diagram, HBase tables are divided horizontally by row key range into “Regions”. Regions are assigned to the nodes in the cluster, called “Region Servers”. Regions are assigned to the nodes in the cluster, called “Region Servers”. These servers serve data for reading and writing.?
HBase Architectural Components: ZooKeeper
HBase has a distributed environment where HMaster couldn’t manage everything on its own. And this is where ZooKeeper came into play. ZooKeeper is a distributed coordination service to maintain server state in the cluster. It maintains and tracks which servers are alive and available, and provides server failure notification. Here’s how? ZooKeeper operates:
1. Active HMaster sends a heartbeat signal to ZooKeeper indicating that it’s active.
2. Region servers send their status to ZooKeeper indicating they are ready to read and write operations.?
3. The inactive server acts as a backup. If the active HMaster fails, it will come to rescue.?
Now let’s see how each of these components work together. So, Active HMaster and Region Servers connect with a session to ZooKeeper.
The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats to indicate that region servers are up and running.
Now let’s move onto our next topic and see how HBase operates read and write.
领英推荐
HBase Read or Write
There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. When a client reads or writes data to HBase, the following takes place:
The client gets the Region Server that hosts the META table from ZooKeeper. Then the client will query the .META server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location.?
It will then get the Row from the corresponding Region Server:
HBase Meta Table
In HBase, the table is used to find the Region for a given Table key. Special HBase catalog table that maintains a list of all the Region Servers in the HBase storage system:
HBase Write Mechanism
The mechanism works in four steps, and here’s how:
1. Write Ahead Log (WAL) is a file used to store new data that is yet to be put on permanent storage. It is used for recovery in the case of failure. When a client issues a put request, it will write the data to the write-ahead log (WAL).
2. MemStore is the write cache that stores new data that has not yet been written to disk. There is one MemStore per column family per region. Once data is written to the WAL, it is then copied to the MemStore.
3. Once the data is placed in MemStore, the client then receives the acknowledgment.
4. Hfiles store the rows of data as sorted KeyValue on disk. When the MemStore reaches the threshold, it dumps or commits the data into an HFile.?
Now that we have understood the theory part of HBase, you can learn how HBase works through a demo.?
Demo
Before starting off with the demo, you can navigate to hbase.apache.org to gain some information on HBase and you can also go through the HBase reference guide. In this demo, we will be working on the Oracle VirtualBox and we will use the Cloudera QuickStart installed here.?
You can start off by selecting the HBase Master as shown below from the Hue interface.?
Once you click on the Master, you will get an overview of the region servers, tables, tasks, the ZooKeeper version, and various other software attributes. Then open up a terminal window to start off with the demo. You can zoom in for better visibility while typing inside the terminal window. First, to start off, you should open up the HBase shell and for that you need to type:
hbase shell //Opens the HBase shell
After a couple of seconds, you’ll be inside the HBase shell where you can type the HBase commands. You can start off by typing the following commands:
list //Lists down all the tables present in HBase
create ‘newtbl’, ‘knowledge’ //Creates a new table
describe ‘newtbl’ //Checks if the table was created
status ‘summary’? ? //Checks the status of HBase
Now that we have created a new table, let’s put some data into it.?
put ‘newtbl’, ‘r1’, ‘knowledge:sports’, ‘cricket’
put ‘newtbl’, ‘r1’, ‘knowledge:science’, ‘chemistry’
put ‘newtbl’, ‘r1’, ‘knowledge:science’, ‘physics’
put ‘newtbl’, ‘r2’, ‘knowledge:economics’, ‘macro economics’
put ‘newtbl’, ‘r2’, ‘knowledge:music’, ‘pop music’
Let’s now list the contents of the table by typing:
scan ‘newtbl’
The output will be as seen below:
As you can see from the above image, we cannot see “chemistry” as it will only display the latest update which in this case is “physics”. Now we can type the following commands:
is_enabled ‘newtbl’ //Checks if the table is enabled
disable ‘newtbl’ //Disables the table
scan ‘newtbl’ //Lists the contents of the table. Note that this will throw an error as the table is disabled.
Now before we move and enable the table, let’s do an alteration on it.?
alter ‘newtbl’, ‘test_info’ //Updates column family in the table
enable ‘newtbl’ //Enables the table
describe ‘newtbl’ //Checks the column families after updating
The output will be as seen below:
Then extract values for one particular row and also see how to add new information to a row by using the following commands:
get ‘newtbl’, ‘r1’ //Extracts the values for r1 in the table
put ‘newtbl’, ‘r1’, ‘knowledge:economics’, ‘market economics’? //Adds new information to r1 for economics. Note that this will update the table but will not override the information
get ‘newtbl’, ‘r1’//Displays the results for r1
The output will be as seen below:
You can go back to Cloudera HBase Master status and see that user tables are one. You could click on details to view the data we fed in. This brings us to the end of this quick demo on HBase.?
Conclusion
I hope this tutorial on HBase has helped you gain a better understanding of how HBase works. You understood what HBase is, an HBase use case, various applications of HBase. You also saw the differences between HBase and RDBMS, learned about the HBase storage, and it’s architectural components. You also learned how HBase works through a short demo.