登录查看更多内容

Optimizing Database Performance with Statistics Updates

Bob Boroujerdi Far

Data Architect @ S&P Global | Database Services

发布日期: 2024年10月29日

To enhance database performance, a range of techniques can be employed, including utilizing indexes, partitioning and sharding data, defining optimal clustering keys, and managing data distribution. Efficient management practices, such as archiving or purging outdated data, optimizing memory and cache, and preloading frequently accessed data into cache through query warming, can all contribute to reducing table sizes and improving response times. Additionally, handling hardware scaling, whether vertical or horizontal, can address performance at a system level. This article will explore a high-impact strategy for immediate performance gains.

"Updating statistics" in a database refers to the process of refreshing or recalculating the statistical data that the query optimizer relies on to make efficient query execution plans.

Statistics typically include information on data distribution, number of rows, number of distinct values, and frequency of values within tables and indexes.

By keeping these statistics up-to-date, the database can make more accurate decisions about how to access and join tables, whether to use indexes, and other factors impacting query performance.

Here’s a breakdown of how updating statistics improves database performance:

Enhanced Query Plans: Fresh statistics give the optimizer a more accurate understanding of data distribution, which helps it choose the most efficient query plan.
Improved Index Usage: With accurate statistics, the optimizer can determine if an index will be beneficial for specific queries.
Efficient Memory and I/O Management: Updated statistics allow for better memory allocation and optimized I/O operations by predicting row sizes and query requirements more accurately.

Gathering database statistics at the right intervals is critical for maintaining performance without overburdening system resources. Updating statistics too frequently, especially in large and active environments, can consume significant processing power and I/O resources, or Query Plan Instability potentially impacting performance. Best practices recommend scheduling statistics updates after significant data loads or large-scale deletions to reflect substantial changes accurately.

Updating statistics for a table is vital for query optimization, especially with sample data. Automatic updates work well with evenly distributed data, allowing for accurate execution plans. However, for columns with low cardinality or uneven distribution, manual updates can enhance performance.

Evenly Distributed Data: Automatic updates provide accurate data characteristics, leading to efficient query plans.
Low Cardinality Columns: Columns with few distinct values can skew statistics, affecting selectivity estimates. Manual updates can improve query performance.
Uneven Data Distribution: Default statistics may not accurately represent actual distributions. Manual updates help the query planner make better decisions.

Below are key insights regarding updating statistics for databases, based on my experience. I will not cover the syntax for updating statistics, as it varies across different database systems and involves numerous options and input parameters. For specific syntax, I recommend consulting documentation or using AI resources for guidance.

1. PostgreSQL

In PostgreSQL, the ANALYZE command collects statistics about the contents of tables or columns and stores the results in the pg_statistic system catalog. By default, the autovacuum feature handles the automatic analysis of tables.

For large tables, ANALYZE takes a random sample of the data rather than scanning every row, enabling timely analysis even for substantial datasets.

It is essential to note that the statistics gathered are approximate and may vary slightly with each execution of ANALYZE, even if the underlying table data remains unchanged.

2. Amazon Redshift

Amazon Redshift is a data warehouse solution based on PostgreSQL that allows you to specify the scope of the ANALYZE command in several ways:

The entire current database
A single table
Specific columns in a table
Columns likely to be used as predicates in queries

When querying a new table that hasn't been analyzed after a data load, Redshift issues a warning. The system continuously monitors performance and automatically performs ANALYZE operations during low workload periods to minimize impact. Automatic analysis is enabled by default, but you can disable it by setting the auto_analyze parameter to false in your cluster's parameter group.

3. Greenplum

Greenplum is a columnar, massively parallel processing (MPP) database built on PostgreSQL. Similar to PostgreSQL, ANALYZE in Greenplum collects statistics about tables, columns, and partitions, storing these results in the pg_statistic system table.

For partitioned tables, ANALYZE gathers additional statistics known as HyperLogLog (HLL) statistics on the leaf partitions. These statistics help derive the number of distinct values (NDV) for queries involving partitioned tables.

The analyzedb utility in Greenplum performs incremental and concurrent ANALYZE operations on schemas, tables, columns, and partitions.

4. MySQL

The ANALYZE TABLE command in MySQL gathers statistics about tables and columns, offering options to analyze with or without histograms and to drop histograms for specific columns, each with unique benefits.

Statistics Gathering:

Without Histograms: Analyzing without histograms performs key distribution analysis, which helps MySQL optimize the order of table joins and select appropriate indexes.

With Histograms: The UPDATE HISTOGRAM clause generates and stores histogram statistics for specified columns in the data dictionary, enhancing query optimization.

Dropping Histograms: The DROP HISTOGRAM clause removes histogram statistics from the data dictionary for specific columns.

Partitioned Tables: The command supports partitioned tables, allowing analysis of specific partitions using ALTER TABLE ... ANALYZE PARTITION.
Locking Behavior: During analysis, the table is locked with a read lock for both InnoDB and MyISAM storage engines.

领英推荐

The Evolution of Databases: Why It Matters for Modern…

ITVersity, Inc. 2 个月前

Understanding the Star Schema!

Lyftrondata 5 个月前

Why Use a Graph Database? Benefits Of Graph Databases

NebulaGraph Database （Nebula Graph Database) 2 年前

While MySQL automatically updates statistics, using the ANALYZE TABLE command allows for greater control over index statistics updates.

5. MariaDB

MariaDB, a fork of MySQL, employs the same ANALYZE TABLE functionality to analyze and store key distribution statistics for tables. MariaDB also utilizes this stored information to optimize join operations on non-constant conditions. In MariaDB, the ANALYZE TABLE command can be executed on tables, columns, or indexes, similar to MySQL.

6. Microsoft SQL Server

SQL Server's query optimizer automatically updates statistics to maintain optimal query performance. However, there are scenarios where manually updating statistics using the UPDATE STATISTICS command or the sp_updatestats stored procedure can further enhance query performance.

Query Plan Recompilation:

It is important to note that updating statistics can trigger automatic recompilation of query plans. Therefore, frequent updates may not be advisable, as there is a performance trade-off between optimizing query plans and the overhead associated with recompilation.

Scope of Updates:

In SQL Server, statistics updates can be performed on schemas, tables, indexes, partitions, and columns. Incremental updates are available for partitions, as previously explained.

7. Azure Synapse Analytics

Similar to SQL Server, Azure Synapse Analytics provides mechanisms to update statistics. However, caution is warranted due to the nature of big data environments. Updating statistics using sampled data can be more efficient for certain columns, such as country or city, where data changes infrequently. In most cases, there may be no need to perform updates unless new data for a country or city is introduced. For tables with frequently changing data, such as those containing date columns that are updated daily, regular updates to statistics may be necessary to ensure query performance remains optimal.

8. Singelstoe (formerly MemSQL)

SingleStore uses an automated system, known as Autostats, to gather the necessary statistics for the query optimizer. This system runs automatically, so most users will not need to manually run statistics-gathering commands. However, if you prefer to manage statistics manually, the ANALYZE command is available to gather statistics for tables or columns, or to remove existing statistics as needed.

Autostats operates incrementally or periodically for columnstore tables to keep statistics up-to-date. For rowstore tables, it only performs periodic updates to maintain the statistics, ensuring optimal query performance.

9. Snowflake

Snowflake handles all aspects of data storage management automatically, including data organization, file sizing, structure, compression, metadata, and statistics. Unlike traditional databases, Snowflake does not require or provide commands to manually gather or update statistics; all optimization and maintenance tasks are managed by Snowflake's automated systems. This ensures efficient storage and query performance without manual intervention.

10. Sybase

In Sybase, the UPDATE STATISTICS command can be run on tables, columns, indexes, and partitions to keep distribution data on key values accurate for the query optimizer. When a nonclustered index is created on an existing table with data, Sybase automatically updates statistics for the new index. For clustered indexes, statistics are updated for all indexes on the table. It’s recommended to run UPDATE STATISTICS on system tables with many rows to avoid performance issues in system stored procedures. This ensures more efficient query planning by the optimizer, resulting in better performance.

11. MongoDB

In MongoDB, gathering and updating statistics is integral to the query optimization process, allowing the query planner to make informed decisions on index usage, sorting methods, and data retrieval paths. Unlike traditional relational databases, MongoDB does not have a dedicated “update statistics” command; instead, it continuously collects and updates query execution stats and index usage information as queries run. This automatic collection aids the optimizer in adjusting plans based on workload patterns.

12. Oracle

Oracle optimizes query execution by gathering statistics at the database, schema, table, or index level using procedures like DBMS_STATS.GATHER_DATABASE_STATS, GATHER_SCHEMA_STATS, and GATHER_TABLE_STATS. The CASCADE option extends this to dependent indexes for comprehensive data insight. Users can specify sampling rates, like estimate_percent, to balance accuracy with performance.

For dynamic environments, Oracle supports Automatic Optimizer Statistics Collection, automatically updating stats on tables with frequent data changes. Histograms, such as frequency and height-balanced types, capture data distribution details to help the optimizer select the best execution path, especially for skewed data. With the INCREMENTAL setting, Oracle also gathers partition statistics incrementally, updating only modified partitions while reusing existing stats for others, ensuring efficient statistics management across partitioned tables.

13. DB2

In IBM Db2, the RUNSTATS utility collects and updates statistics on tables, indexes, or columns, aiding in query optimization by providing the optimizer with data on table size, distribution, and index selectivity. Db2 allows for statistics gathering at various levels, such as tables, schemas, and individual columns, with options to specify frequency values (NUM_FREQVALUES) and distribution quantiles (NUM_QUANTILES) for better data distribution insights. Db2 also supports automatic statistics collection, which updates stats for frequently modified tables to maintain performance. Additionally, Db2 offers flexibility for partitioned tables, allowing statistics to be gathered on individual partitions, making it efficient to update only modified sections of large tables.

In summary, updating database statistics is a critical strategy for enhancing database performance and optimizing query execution. By ensuring that the statistics used by the query optimizer are current, databases can generate more effective execution plans, leading to improved index usage, memory and I/O management, and overall response times.

The importance of keeping statistics up-to-date cannot be overstated, particularly in dynamic environments where data changes frequently. Regularly scheduled updates, especially after significant data modifications, are essential for maintaining optimal performance without overwhelming system resources. Different database systems offer various mechanisms for managing statistics, each with its own best practices and approaches.

By employing effective statistics management techniques and understanding the specific features of your database system, you can significantly enhance query optimization and overall database efficiency, leading to a more responsive and scalable data environment.

要查看或添加评论，请登录

Bob Boroujerdi Far的更多文章

SnowConvert

2024年12月17日

SnowConvert

#SnowConvert turns code from other platforms into #Snowflake code only. It does not migrate data or facilitate testing…
External Table/Foreign Table for Snowflake

2024年11月25日

External Table/Foreign Table for Snowflake

I recently proposed an idea to the Snowflake community, and I’d appreciate your support! Below are some potential use…
SQL, NoSQL, and NewSQL: A Comparative Analysis of Database Technologies

2024年11月11日

SQL, NoSQL, and NewSQL: A Comparative Analysis of Database Technologies

SQL has been essential for communicating with databases, transferring data, and retrieving information. This structured…
Optimizing Query Performance with Histograms: Understanding Data Distribution in Databases

2024年11月6日

Optimizing Query Performance with Histograms: Understanding Data Distribution in Databases

In databases, histograms are statistical tools used to describe data distribution within a table column. They help the…
Understanding DELETE vs. TRUNCATE in Databases

2024年8月11日

Understanding DELETE vs. TRUNCATE in Databases

For database administrators and developers focused on optimizing performance, understanding the nuances of data…

4 条评论
Introduction to Shell for Data Science Course

2018年1月27日

Introduction to Shell for Data Science Course

The Unix command line has survived and thrived for almost fifty years because it lets people do complex things with…
Intermediate Python for Data Science

2018年1月14日

Intermediate Python for Data Science

My second course in DataCamp.

2 条评论
DB-Engines Ranking - Trend Popularity

2018年1月7日

DB-Engines Ranking - Trend Popularity

It seems PostgreSQL and MangoDB are going calmly to seat close to MS SQL, Oracle and mySQL. more details are here:…
Discover Oceanwide!

2017年5月10日

Discover Oceanwide!

One of the best company's that I have ever worked for. Yo can also check the open position here.
The painting of my 13 years old niece.

2016年6月24日

The painting of my 13 years old niece.

Please follow her on Instagram! https://www.instagram.

1 条评论

See all articles

Optimizing Database Performance with Statistics Updates

Bob Boroujerdi Far

Data Architect @ S&P Global | Database Services

1. PostgreSQL

2. Amazon Redshift

3. Greenplum

4. MySQL

领英推荐

5. MariaDB

6. Microsoft SQL Server

7. Azure Synapse Analytics

8. Singelstoe (formerly MemSQL)

9. Snowflake

10. Sybase

11. MongoDB

12. Oracle

13. DB2

Bob Boroujerdi Far的更多文章

社区洞察

其他会员也浏览了

What is Database Schema: Types, Benefits and Designs

Database Vendors: Their Role in Supporting Business Intelligence

Unveiling the Evolution and Significance of Databases: Navigating the Digital Realm

Introduction to Data Pages in PEGA

What is the difference between a file system and a database?

Data Science and System Design for Product Managers

Migration Planning: Data Catalog & Lineage

Understanding Database Concepts

Achieving Peak Database Performance

Data Structures powering our Database Part 1 - Hash Indexes

1. PostgreSQL

2. Amazon Redshift

3. Greenplum

4. MySQL

领英推荐

5. MariaDB

6. Microsoft SQL Server

7. Azure Synapse Analytics

8. Singelstoe (formerly MemSQL)

9. Snowflake

10. Sybase

11. MongoDB

12. Oracle

13. DB2

Bob Boroujerdi Far的更多文章

SnowConvert

External Table/Foreign Table for Snowflake

SQL, NoSQL, and NewSQL: A Comparative Analysis of Database Technologies

Optimizing Query Performance with Histograms: Understanding Data Distribution in Databases

Understanding DELETE vs. TRUNCATE in Databases

Introduction to Shell for Data Science Course

Intermediate Python for Data Science

DB-Engines Ranking - Trend Popularity

Discover Oceanwide!

The painting of my 13 years old niece.

社区洞察

其他会员也浏览了

What is Database Schema: Types, Benefits and Designs

Database Vendors: Their Role in Supporting Business Intelligence

Unveiling the Evolution and Significance of Databases: Navigating the Digital Realm

Introduction to Data Pages in PEGA

What is the difference between a file system and a database?

Data Science and System Design for Product Managers

Migration Planning: Data Catalog & Lineage

Understanding Database Concepts

Achieving Peak Database Performance

Data Structures powering our Database Part 1 - Hash Indexes