Tuning Greenplum Database
??????????????? ? ?.
???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????
In this post we gonna learn all about, how to tune Greenplum. Learning how to tune your database is really important, because your database access is a core feature of most of your applications. For at least 80% of all applications we see, simple database performance tuning can speed up applications significantly.
Fortunately, there isn’t a lot of rocket science involved until you get really deep under the hood of database tuning. I’ve been a Developer and a DBA using Oracle, SQL Server and Pivotal Greenplum. By far, “tuning” Greenplum is much easier than the other databases which I've worked on, and the main reason why is the architecture (MPP - also known as a shared nothing architecture) of Greenplum versus an OLTP database.
Normally with all those other databases, DBA's tell the users that they write bad SQL, and they create more and more indexes to handle the variety of queries, lastly they would request the application team to redesign the dimensional model to handle how users really wanted to look at the data.
But when it comes to Greenplum Database (GPDB), there are couple of important steps that users would want to have in-order to design a schema, and structure their application for high performance. If you can understand and get good at managing these steps, then you can have a well running Greenplum system (Most of them are setup when you create the table).
Step 1 ? “ Table Distribution”
Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a “distribution key” defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns (ideally, use a single column that will distribute data across all segments evenly). If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this as the “skew”.
In an MPP shared nothing environment, the overall response time for a query is measured by the completion time of all the segments. The system is only as fast as the slowest segment. If the data is skewed, segments with more data will take more time to complete, so every segment must have an approximately equal number of rows and perform approximately the same amount of processing (good distribution is the key to parallelism).
If you come across poor performance and out of memory conditions, then its because, one of the database segment has significantly more data to process than other segments.
Identifying Distribution Keys: (Ex: Oracle to Greenplum)
- ? If a table contains primary key in Oracle, consider it as a distribution key in Greenplum.
- ? If a table in Oracle has no primary key, but contains unique key; then consider unique key with fewer columns as a distribution key in Greenplum.
- ? Choose Random distribution, when table does not have a candidate column as a distribution key (is acceptable for small tables, such as dimension tables, but not acceptable for large tables such as fact tables).
Rules of Distribution:
- ? Explicitly define a column or random distribution for all tables using optional clauses DISTRIBUTED BY and DISTRIBUTED RANDOMLY (which will specify the distribution policy for a table).
- ? Do not DISTRIBUTE RANDOMLY because it is the easy choice (eventually it does a redistribute or broadcast motion for any query that joins to a distributed randomly table).
- ? Do not use the default, that means not specifying any distribution clause. The default is a hash distribution policy that uses either the PRIMARY KEY (if the table has one) or the first eligible column of the table, as the distribution key. If the table does not have an eligible column, then a random distribution is used.
- ? Do not distribute on columns that will be used in the WHERE clause of a query, and also not to distribute on dates or timestamps.
- ? Never distribute and partition tables on the same column.
- ? Achieve local joins to significantly improve performance by distributing on the same column for large tables commonly joined together.
- ? Validate that data is evenly distributed after the initial load and after incremental loads.
- ? Finally ensure that there is no data skew.
Step 2 ? “ Table Partitioning”
A good partitioning strategy will reduce the amount of data to be scanned, by reading only the relevant data needed to satisfy a query (storing data by partition allows you to, easily get good performance from a query, when you only care about data in a subset of the partition). Each partition is a separate physical file on every segment, so you need to partition only your large tables to eliminate Disk I/O. Greenplum supports list, range and multi-level partitioning.
Partitioning is not a substitution for Distribution (a good distribution strategy and partitioning together achieves partition elimination, and unlocks performance magic). The Greenplum optimizer (Orca) will eliminate partitions based on the SQL that you execute.
Distributions and Partitioning
The picture above shows you the combination of distribution and partitioning, where the sales table is DISTRIBUTED BY store_id (so the sales based on store_id are evenly divided across all the segments, because the store_id is unique identifier to the table) and partitioned using sale_date. Even though the data is already distributed evenly across all the segments, but if your query only reads 7 days of data, then having distributed by only will not help you to get good performance. You need to partition the table using partition key (sale_date), by doing partitioning you eliminate reading the unwanted blocks of data, and only reads those 7 days of data.
These two things together, the Partitioning and the Distribution across MPP systems, allows you to divide a big problem into a small problem, and make your queries to run faster.
Rules of Partitioning:
- ? Do not partition small tables. Partitioning is for your larger tables and won’t help very much for smaller tables.
- ? Do not use the default partitions.
- ? Try choosing range partitioning over list partitioning.
- ? Do not use multi-level partitioning. Try creating fewer partitions, with more data in each partition.
- ? Never partition and distribute tables on the same column.
- ? Make sure your queries are selectively scanning the partitioned tables (you can check this by examining the query EXPLAIN plan).
- ? To achieve partition elimination, your query predicate (WHERE clause) must be the same as the partitioning criteria.
- ? Do not create too many partitions with column-oriented storage, because it will increase the number of physical files on every segment. Here is how the math works: “Number of Physical Files = Number of Segments * Columns * Partitions”.
If you want to know more about table storage model in Greenplum, please do read my other post titled “Storage Comes At a Price”.
Step 3 ? “ Table Compression and Column Orientation”
Greenplum created a hybrid storage layer, where you can store data in Row and Column oriented tables (and partitions) or a combination of both. For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress these column oriented tables, to reduce Disk I/O (it's basically another strategy).
Rules of Compression:
- ? Use compression on pretty large AO (append-optimized) and partitioned tables, to improve disk I/O across the system. It won’t help very much for smaller tables.
- ? The best practice is to set the column compression settings at the level where the data resides.
- ? New partitions added to a partitioned table do not automatically inherit compression defined at the table level; you must specifically define compression when you add new partitions.
- ? Data compression should never be used for data that is stored on a compressed file system.
If you want to know about table Compression and Column orientation techniques in Greenplum, please do read my other post titled “Storage Comes At a Price”.
Step 4 ? “ Indexes”
Last but not least are indexes. The general recommendation with Greenplum (and shared nothing architectures) is to start with no indexes at all. Let's put it like that: data distribution in Greenplum is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. So, I tell the users to rarely use an index, except for enforcing a Primary Key.
Greenplum supports B-Tree, Bitmap, and GiST indexes.
- ? B-Tree indexes can be unique or not.
- ? Bitmap indexes are good useful when there are 100 to 100,000 distinct values (Bitmap indexes are best suited on columns for querying than updating).
- ? GiST (Generalized Search Tree) indexes are used to support PostGIS.
Rules of Indexes:
- ? Ideally indexes are not needed in Greenplum Database.
- ? Try creating an index on a single column of a columnar table for drill-through purposes, and for high cardinality tables that are required for selective queries.
- ? Columns that are frequently referenced in WHERE clauses are good candidates for indexes.
- ? Do not index columns that are frequently updated.
- ? Create selective B-Tree indexes, and make sure you not use Bitmap indexes for transactional workloads.
- ? If at all you end up creating an index on a table, then drop the index before loading the data into a table. After the load is completed, re-create the index for the table.
- ? Try not to create indexes on partitioned tables. If indexes are needed, the index columns must be different from the partition columns.
As a general rule of thumb I'd say, first test and compare query performance with and without an index. Add an index only if query performance improves with indexed columns.
“Conclusion”
Tuning Greenplum Database means not looking at lots of queries. Instead it means checking to see if the table has a proper “distribution key”. Then for your larger tables (probably 10% or less of your database tables) do “partition, compress, and make column oriented”. If that's not enough or appropriate for your environment, then evaluate to add some “indexes”, including bitmap indexes, maybe for 1 or 2 tables.
Gone are the days of looking at the bad SQL. No more creating indexes for everything under the hood. Just load the data and follow my four important tuning steps.?
References
- Pivotal Greenplum.
- Inspired, from Pivotal Open Source Hub.
Consultant at Infosys
6 年Nice post Sandeep...Thnks for sharing...keep it up...
Senior Solutions Engineer Data @ VMware by Broadcom
8 年Good job, well summarized!! I might not completely convinced with you that we don't need to tune sql queries, I have seen if sql is badly written then it doesn't matter how well you follow MPP thumb rules. Reading and understanding explain plan of the query and to understand how the sql will execute on database is "must" for atleast GP DBA.
?? datAbasE/ETL engineeR/admiN/developeR
8 年Great 1
Pre-Sales Leader and Architect
8 年This is a useful guide. If you are working with GPDB or HDB/HAWQ, this is must reading. https://www.dhirubhai.net/pulse/tuning-greenplum-database-sandeep-katta-