登录查看更多内容

Tuning Greenplum Database

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

发布日期: 2016年2月22日

In this post we gonna learn all about, how to tune Greenplum. Learning how to tune your database is really important, because your database access is a core feature of most of your applications. For at least 80% of all applications we see, simple database performance tuning can speed up applications significantly.

Fortunately, there isn’t a lot of rocket science involved until you get really deep under the hood of database tuning. I’ve been a Developer and a DBA using Oracle, SQL Server and Pivotal Greenplum. By far, “tuning” Greenplum is much easier than the other databases which I've worked on, and the main reason why is the architecture (MPP - also known as a shared nothing architecture) of Greenplum versus an OLTP database.

Normally with all those other databases, DBA's tell the users that they write bad SQL, and they create more and more indexes to handle the variety of queries, lastly they would request the application team to redesign the dimensional model to handle how users really wanted to look at the data.

But when it comes to Greenplum Database (GPDB), there are couple of important steps that users would want to have in-order to design a schema, and structure their application for high performance. If you can understand and get good at managing these steps, then you can have a well running Greenplum system (Most of them are setup when you create the table).

Step 1 ? “ Table Distribution”

Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a “distribution key” defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns (ideally, use a single column that will distribute data across all segments evenly). If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this as the “skew”.

In an MPP shared nothing environment, the overall response time for a query is measured by the completion time of all the segments. The system is only as fast as the slowest segment. If the data is skewed, segments with more data will take more time to complete, so every segment must have an approximately equal number of rows and perform approximately the same amount of processing (good distribution is the key to parallelism).

If you come across poor performance and out of memory conditions, then its because, one of the database segment has significantly more data to process than other segments.

Identifying Distribution Keys: (Ex: Oracle to Greenplum)

? If a table contains primary key in Oracle, consider it as a distribution key in Greenplum.
? If a table in Oracle has no primary key, but contains unique key; then consider unique key with fewer columns as a distribution key in Greenplum.
? Choose Random distribution, when table does not have a candidate column as a distribution key (is acceptable for small tables, such as dimension tables, but not acceptable for large tables such as fact tables).

Rules of Distribution:

? Explicitly define a column or random distribution for all tables using optional clauses DISTRIBUTED BY and DISTRIBUTED RANDOMLY (which will specify the distribution policy for a table).
? Do not DISTRIBUTE RANDOMLY because it is the easy choice (eventually it does a redistribute or broadcast motion for any query that joins to a distributed randomly table).
? Do not use the default, that means not specifying any distribution clause. The default is a hash distribution policy that uses either the PRIMARY KEY (if the table has one) or the first eligible column of the table, as the distribution key. If the table does not have an eligible column, then a random distribution is used.
? Do not distribute on columns that will be used in the WHERE clause of a query, and also not to distribute on dates or timestamps.
? Never distribute and partition tables on the same column.
? Achieve local joins to significantly improve performance by distributing on the same column for large tables commonly joined together.
? Validate that data is evenly distributed after the initial load and after incremental loads.
? Finally ensure that there is no data skew.

Step 2 ? “ Table Partitioning”

A good partitioning strategy will reduce the amount of data to be scanned, by reading only the relevant data needed to satisfy a query (storing data by partition allows you to, easily get good performance from a query, when you only care about data in a subset of the partition). Each partition is a separate physical file on every segment, so you need to partition only your large tables to eliminate Disk I/O. Greenplum supports list, range and multi-level partitioning.

Partitioning is not a substitution for Distribution (a good distribution strategy and partitioning together achieves partition elimination, and unlocks performance magic). The Greenplum optimizer (Orca) will eliminate partitions based on the SQL that you execute.

Distributions and Partitioning

The picture above shows you the combination of distribution and partitioning, where the sales table is DISTRIBUTED BY store_id (so the sales based on store_id are evenly divided across all the segments, because the store_id is unique identifier to the table) and partitioned using sale_date. Even though the data is already distributed evenly across all the segments, but if your query only reads 7 days of data, then having distributed by only will not help you to get good performance. You need to partition the table using partition key (sale_date), by doing partitioning you eliminate reading the unwanted blocks of data, and only reads those 7 days of data.

These two things together, the Partitioning and the Distribution across MPP systems, allows you to divide a big problem into a small problem, and make your queries to run faster.

Rules of Partitioning:

? Do not partition small tables. Partitioning is for your larger tables and won’t help very much for smaller tables.
? Do not use the default partitions.
? Try choosing range partitioning over list partitioning.
? Do not use multi-level partitioning. Try creating fewer partitions, with more data in each partition.
? Never partition and distribute tables on the same column.
? Make sure your queries are selectively scanning the partitioned tables (you can check this by examining the query EXPLAIN plan).
? To achieve partition elimination, your query predicate (WHERE clause) must be the same as the partitioning criteria.
? Do not create too many partitions with column-oriented storage, because it will increase the number of physical files on every segment. Here is how the math works: “Number of Physical Files = Number of Segments * Columns * Partitions”.

If you want to know more about table storage model in Greenplum, please do read my other post titled “Storage Comes At a Price”.

Step 3 ? “ Table Compression and Column Orientation”

Greenplum created a hybrid storage layer, where you can store data in Row and Column oriented tables (and partitions) or a combination of both. For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress these column oriented tables, to reduce Disk I/O (it's basically another strategy).

Rules of Compression:

? Use compression on pretty large AO (append-optimized) and partitioned tables, to improve disk I/O across the system. It won’t help very much for smaller tables.
? The best practice is to set the column compression settings at the level where the data resides.
? New partitions added to a partitioned table do not automatically inherit compression defined at the table level; you must specifically define compression when you add new partitions.
? Data compression should never be used for data that is stored on a compressed file system.

If you want to know about table Compression and Column orientation techniques in Greenplum, please do read my other post titled “Storage Comes At a Price”.

Step 4 ? “ Indexes”

Last but not least are indexes. The general recommendation with Greenplum (and shared nothing architectures) is to start with no indexes at all. Let's put it like that: data distribution in Greenplum is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. So, I tell the users to rarely use an index, except for enforcing a Primary Key.

Greenplum supports B-Tree, Bitmap, and GiST indexes.

? B-Tree indexes can be unique or not.
? Bitmap indexes are good useful when there are 100 to 100,000 distinct values (Bitmap indexes are best suited on columns for querying than updating).
? GiST (Generalized Search Tree) indexes are used to support PostGIS.

Rules of Indexes:

? Ideally indexes are not needed in Greenplum Database.
? Try creating an index on a single column of a columnar table for drill-through purposes, and for high cardinality tables that are required for selective queries.
? Columns that are frequently referenced in WHERE clauses are good candidates for indexes.
? Do not index columns that are frequently updated.
? Create selective B-Tree indexes, and make sure you not use Bitmap indexes for transactional workloads.
? If at all you end up creating an index on a table, then drop the index before loading the data into a table. After the load is completed, re-create the index for the table.
? Try not to create indexes on partitioned tables. If indexes are needed, the index columns must be different from the partition columns.

As a general rule of thumb I'd say, first test and compare query performance with and without an index. Add an index only if query performance improves with indexed columns.

“Conclusion”

Tuning Greenplum Database means not looking at lots of queries. Instead it means checking to see if the table has a proper “distribution key”. Then for your larger tables (probably 10% or less of your database tables) do “partition, compress, and make column oriented”. If that's not enough or appropriate for your environment, then evaluate to add some “indexes”, including bitmap indexes, maybe for 1 or 2 tables.

Gone are the days of looking at the bad SQL. No more creating indexes for everything under the hood. Just load the data and follow my four important tuning steps.?

References

Pivotal Greenplum.
Inspired, from Pivotal Open Source Hub.

Abhishek Misale

Consultant at Infosys

6 年

Nice post Sandeep...Thnks for sharing...keep it up...

vishal kumar

Senior Solutions Engineer Data @ VMware by Broadcom

8 年

Good job, well summarized!! I might not completely convinced with you that we don't need to tune sql queries, I have seen if sql is badly written then it doesn't matter how well you follow MPP thumb rules. Reading and understanding explain plan of the query and to understand how the sql will execute on database is "must" for atleast GP DBA.

3 次回应

Gurupreet Singh Bhatia

?? datAbasE/ETL engineeR/admiN/developeR

8 年

Great 1

Leonard Walstad, Ph.D.

Pre-Sales Leader and Architect

8 年

This is a useful guide. If you are working with GPDB or HDB/HAWQ, this is must reading. https://www.dhirubhai.net/pulse/tuning-greenplum-database-sandeep-katta-

1 次回应

查看更多评论

要查看或添加评论，请登录

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

2020年12月10日

11 Years 11 Months & 11 Days

11:11 - Some call it angel numbers & some call it coincidental nonsense. — At first, this occurrence felt like a silly…
Perspective Is Everything

2019年7月22日

Perspective Is Everything

Do you remember hearing the story about the blind men and an elephant each of whom were exploring an elephant for the…
It’s time to Comply with GDPR — Are you Ready?

2018年4月26日

It’s time to Comply with GDPR — Are you Ready?

What Is GDPR? The General Data Protection Regulation (GDPR) is a new regulation aimed at improving the protection of…
Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

2017年7月17日

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

All most all applications need to use some kind of persistent durable storage to get their job done. This might be to…
Reading and Writing Data to Cloud Storage using Pivotal Greenplum

2016年12月21日

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

The evolution of the cloud has seen many organizations evolve from private, community, public and now hybrid cloud…
Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

2016年9月2日

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Like all other databases, Greenplum Database (GPDB) can do different kinds of computations: it can add, subtract…
Optimize GPDB with Vacuum

2016年8月1日

Optimize GPDB with Vacuum

Just like your house gets dirty and requires cleaning from time to time, your Greenplum Database (GPDB) may accumulate…

1 条评论
Analyzing Query Execution Plans In Greenplum

2016年6月30日

Analyzing Query Execution Plans In Greenplum

Query Execution plans can become very useful tool for every database developer and DBA. They provide a deeper…

5 条评论
Big Data Analysis with MapReduce and Greenplum

2016年5月22日

Big Data Analysis with MapReduce and Greenplum

Before I say anything about Big Data Analysis/MapReduce, let me start this with a quote: “Google is living a few years…

4 条评论
Scaling Greenplum with PgBouncer

2016年4月22日

Scaling Greenplum with PgBouncer

Deploying Greenplum in a high-demand environment requires reliability and scalability. (Scalability, simply, is about…

4 条评论

See all articles

Tuning Greenplum Database

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

??????????????? ? ?.的更多文章

社区洞察

其他会员也浏览了

An Introduction to Oracle TimesTen Scaleout

Top 15 Emerging Databases to Use in 2022 and Beyond

How to Bring Database Online From in Recovery Mode SQL Server?

Introducing Database Quick Query: A High-Speed Wrapper for Database Connectors

DDL Statements In Oracle RDBMS Pt.2

SQL Server Upgrades: Reducing Complexity & Effort

SQL Server Cannot Drop Database In Use in Single User Mode

Database Selection Criteria

Oracle Database Performance Tuning

A review of "Why customers choose Oracle Globally Distributed Database over CockroachDB"

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

Perspective Is Everything

It’s time to Comply with GDPR — Are you Ready?

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Optimize GPDB with Vacuum

Analyzing Query Execution Plans In Greenplum

Big Data Analysis with MapReduce and Greenplum

Scaling Greenplum with PgBouncer

社区洞察

其他会员也浏览了

An Introduction to Oracle TimesTen Scaleout

Top 15 Emerging Databases to Use in 2022 and Beyond

How to Bring Database Online From in Recovery Mode SQL Server?

Introducing Database Quick Query: A High-Speed Wrapper for Database Connectors

DDL Statements In Oracle RDBMS Pt.2

SQL Server Upgrades: Reducing Complexity & Effort

SQL Server Cannot Drop Database In Use in Single User Mode

Database Selection Criteria

Oracle Database Performance Tuning

A review of "Why customers choose Oracle Globally Distributed Database over CockroachDB"