HBase Compaction and Data Locality With Hadoop
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
What is HBase Compaction?As we know, for read performance, HBase is an optimized distributed data store. But this optimal read performance needs one file per column family. Although, during the heavy writes, it is not always possible to have one file per column family. Hence, to reduce the maximum number of disk seeks needed for read, HBase tries to combine all HFiles into a large single HFile. So, this process is what we call Compaction.
Do you know about HBase Architecture
In other words, Compaction in HBase is a process by which HBase cleans itself, whereas this process is of two types: Minor HBase Compaction as well as Major HBase Compaction.
a. HBase Minor Compaction
The process of combining the configurable number of smaller HFiles into one large HFile is what we call Minor compaction. Though, it is quite important since, reading particular rows needs many disk reads and may reduce overall performance, without it.
Here are the several processes which involve in HBase Minor Compaction, are:
- By combining smaller Hfiles, it creates bigger Hfile.
- Also, Hfile stores the deleted file along with it.
- To store more data increases space in memory.
- Uses merge sorting.
b. HBase Major compaction
Whereas, a process of combining the StoreFiles of regions into a single StoreFile, is what we call HBase Major Compaction. Also, it deletes remove and expired versions. As a process, it merges all StoreFiles into single StoreFile and also runs every 24 hours. However, the region will split into new regions after compaction, if the new larger StoreFile is greater than a certain size (defined by property).
Well, the HBase Major Compaction in HBase is the other way to go around:
- Data present per column family in one region is accumulated to 1 Hfile.
- All deleted files or expired cells are deleted permanently, during this process.
- Increase read performance of newly created Hfile.
- It accepts lots of I/O.
- Possibilities for traffic congestion.
- The other name of major compaction process is Write amplification Process.
- And it is must schedule this process at a minimum bandwidth of network I/O.
HBase Compaction Tuninga. Short Description of HBase Compaction:
Now, to enhance performance and stability of the HBase cluster, we can use some hidden HBase compaction configuration like below.
b. Disabling Automatic Major Compactions in HBase
Generally, HBase users ask to possess a full management of major compaction events hence the method to do that is by setting HBase.hregion.majorcompaction to 0, disable periodic automatic major compactions in HBase.
However, it does not offer 100% management of major compactions, yet, by HBase automatically, minor compactions can be promoted to major ones, sometimes, although, we’ve got another configuration choice, luckily, that will help during this case.
Let’s take a tour to HBase Operations.
c. Maximum HBase Compaction Selection Size
Control compaction process in HBase is another option:
hbase.hstore.compaction.max.size (by default value is set to LONG.MAX_VALUE)
In HBase 1.2+ we have as well:
hbase.hstore.compaction.max.size.offpeak
d. Off-peak Compactions in HBase
Further, we can use off-peak configuration settings, if our deployment has off-peak hours.
Here are HBase Compaction Configuration options must set, to enable off peak compaction:
hbase.offpeak.start.hour= 0..23
hbase.offpeak.end.hour= 0..23
Compaction file ratio for off peak 5.0 (by default) or for peak hours is 1.2.
Both can be changed:
hbase.hstore.compaction.ratio
hbase.hstore.compaction.ratio.offpeak
As much high the file ratio value will be, the more will be the aggressive (frequent) compaction. So, for the majority of deployments, default values are fine.