登录查看更多内容

HBase Performance Tuning

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年8月4日

Garbage Collection Tuning

Garbage Collection Parameter is one of the lower-level settings we need to adjust for the region server processes. Although make sure, the master is not a problem here as data does not pass through it and it does not handle any heavy loads either. However, only to the HBase Region Servers, we need to add these Garbage Collection Parameters for HBase Performance Tuning.

Let’s explore HBase pros & cons

Memstore-Local Allocation Buffer

In order to mitigate the issue of heap fragmentation due to too much churn on the memstore instances of an HBase Region Server, version 0.90 of HBase introduced an advanced mechanism, the Memstore-Local Allocation Buffers(MSLAB).

Basically, these MSLABs are buffers of fixed sizes which consist of KeyValue instances of varying sizes. There are times when a buffer cannot completely fit a newly added KeyValue, at that time it is considered full and then once again a new buffer is created for the given fixed size.

HBase Compression

There is one more feature of HBase, that it support for a number of compression algorithms in HBase. Basically, HBase compression algorithms can be enabled at the column family level.

In addition, compression yields better performance, for every other use case, it is possible because there is CPU which is performing the compression and decompression, its overhead is less than the actual demand to read more data from the disk.

i. Available HBase Codecs

There is a fixed list of supported compression algorithms in HBase, we can select from it. Although, when it comes to compression ratio, as well as CPU and installation requirements, they have different qualities.

ii. Verifying Installation

It is highly recommended that you check if the installation was successful, as soon as we have installed a supported HBase compression algorithm. So, to do that, there are several mechanisms in HBase.

HBase Compression test tool

In order to test if compression is set up properly or not, there is a tool available in HBase. Hence, to use it, run following command:

./bin/ hbase org.apache.hadoop.hbase.util.CompressionTest,

Thus, it returns the information on way to run the tool:

$ ./bin/hbase org.apache.hadoop.hbase.util.CompressionTest

Usage: CompressionTest <path> none|gz|lzo|snappy

For example:

hbase class org.apache.hadoop.hbase.util.CompressionTest file:///tmp/testfile gz

iii. Enabling Compression

The installation of the JNI and native compression libraries is must for Enabling compression.

Let’s discuss HBase Use Cases and Real-time Applications

hbase(main):001:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'GZ' }

0 row(s) in 1.1920 seconds

hbase(main):012:0> describe 'testtable'

DESCRIPTION ENABLED

{NAME => 'testtable', FAMILIES => [{NAME => 'colfam1', true

BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS

=> '3', COMPRESSION => 'GZ', TTL => '2147483647', BLOCKSIZE

=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

1 row(s) in 0.0400 seconds

In order to read back the schema of the newly created table, we use the describe HBase shell command. Here, we can see that the compression is set to GZIP. Moreover, we use the alter command for existing tables to enable—or change or disable—the compression algorithm.

Also, disable the compression for the given column family to change the compression format to NONE.

Load Balancing

There is one built-in feature in Master, what we call the balancer. Basically, the balancer runs every five minutes, by default. And, by the hbase.balancer.period property, we configure it.

Its process is like, as soon as it starts, it strives to equal out the number of assigned regions per region server hence they are within one region of the average number per server. Basically, the call first determines a new assignment plan. So, that explains which regions should be moved where. Then by calling the unassign() method of the administrative API iteratively, it starts the process of moving the regions.

Also, there is an upper limit in the balancer, which decides how long it is allowed to run. Basically, by using the hbase.balancer.max.balancing property, it is configured or defaults to half of the balancer period vale, or two and a half minutes.

Merging Regions

Sometimes we may need to merge regions since it is much more common for regions to split automatically over time as we are adding data to the corresponding table.Let’s understand with an example, let we want to reduce the number of regions hosted by each server after we have removed a large amount of data, so there is a tool in HBase which permits us to merge two adjacent regions as long as the cluster is not online. Therefore, below is a command-line tool we can use to get the usage details:

$ ./bin/hbase org.apache.hadoop.hbase.util.Merge

Usage: bin/hbase merge <table-name> <region-1> <region-2>

Client API: Best Practices

There are a handful of optimizations we should consider to gain the best performance while reading or writing data from a client using the API.

Disable auto-flush

By using the setAutoFlush(false) method, set the auto-flush feature ofHTable to false while performing a lot of put operations.

Limit scan scope

It says, be aware of which attributes we are selecting when we use scan to process large numbers of rows.

Learn HBase Operations: Read and Write Operations

Close ResultScanners

This may not help in improving performance, but definitely helps rather avoiding performance problems.

Block cache usage

Furthermore, by the setCacheBlocks() method, we can set Scan instances to use the block cache in the region server.

Optimal loading of row keys
Turn off WAL on Puts

Read Complete Article>>

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

HBase Performance Tuning

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Garbage Collection Tuning

Memstore-Local Allocation Buffer

HBase Compression

i. Available HBase Codecs

ii. Verifying Installation

iii. Enabling Compression

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

CONFIGURING HADOOP CLUSTER USING ANSIBLE

Configuration of HDFS Cluster with Ansible

HBase MemStore

The Whole Ramayan of Cassandra DB

HBase Compaction and Data Locality With Hadoop

Configure Hadoop and start cluster services using Ansible Playbook

Elasticity Task

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.

Garbage Collection Tuning

Memstore-Local Allocation Buffer

HBase Compression

i. Available HBase Codecs

ii. Verifying Installation

iii. Enabling Compression

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

CONFIGURING HADOOP CLUSTER USING ANSIBLE

Configuration of HDFS Cluster with Ansible

HBase MemStore

The Whole Ramayan of Cassandra DB

HBase Compaction and Data Locality With Hadoop

Configure Hadoop and start cluster services using Ansible Playbook

Elasticity Task

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.