登录查看更多内容

EXPLAIN ANALYZE or EXPLAIN (ANALYZE, BUFFERS)

shyam kumar

Infra Ops specialist in Cognizant |EDB certified| PostgreSQL | Azure cloud administration certified | Google cloud certified

发布日期: 2024年6月9日

The BUFFERS option helps us see how much I/O work Postgres did when executing each node in the query execution plan. For database systems, which mostly perform I/O-intensive operations, dealing with too many data pages (or "buffers", "blocks" – depending on the context) is a common reason for poor performance.

Further, we will consider several examples of EXPLAIN (ANALYZE) plans and discuss why it is essential to use the BUFFERS option when troubleshooting the performance of a particular query

Let's consider a simple example involving two tables, each with 2 bigint columns: id (sequential) and num (random). Both tables have exactly the same content. The second table has an index on the num column, while the first table does not.

We will create these tables and compare the performance of a query with and without the index using the BUFFERS option in the EXPLAIN (ANALYZE) statement.

Now let's imagine we need to get 1000 rows with?num > 10000?and ordered by?num. Let's compare plans for both tables:

(Here and below examples show the execution plans with warmed-up caches – in other words, 2nd or subsequent execution; we'll discuss the topic of cache state at the end of the article.)

Finding the target rows in?t1?is ~150x slower –?296.194 ms?vs.?1.935 ms?– because it doesn't have a proper index. What would the?BUFFERS?option tell us? Let's check it using?EXPLAIN (ANALYZE, BUFFERS):

(Note that execution time is noticeably higher when we use?BUFFERS?– getting both buffer and timing numbers doesn't come for free, but discussion of the overhead is beyond the goal of this article.)

For?t1, we had 54173 buffers hit in the buffer pool. Each buffer is normally 8 KiB, so it gives us 8 * 54173 / 1024 = ~423 MiB of data. Quite a lot to read just 1000 rows in a two-column table.

For?t2, we have 1003 hits, or ~7.8 MiB.

One might say that the structure of the plans and access methods used (Parallel Seq Scan?vs.?Index Scan), as well as?Rows Removed by Filter: 333161?in the first plan (strong signal of inefficiency!) are good enough to understand the difference and make proper decisions. Well, for such trivial cases, yes, I agree. Further, we'll explore more sophisticated examples where the?BUFFERS?option shows its strength. Here let me just notice that knowing the buffer numbers is also helpful because we can start understanding the difference between, say, sequential and index scans.

Let's do some math. Our tables have two 8-byte columns, plus a 23-byte header for each tuple (padded to 24 bytes) – it gives us 36 bytes for each tuple (in other words, for each row version). Ten million rows should take 36 * 10000000 / 1024/1024 = ~343 if we ignore extra data such as page headers and visibility maps. Indeed, the table is ~422 MiB:

As we previously saw, to get our target 1000 rows from?t2, we need 1003 buffer hits (~7.8 MiB). Can we do better? Of course. For example, we could include?id?in the index?i_t2_num, tune autovacuum to make it process the table much more frequently as it would with default settings, to keep visibility maps well-maintained, and benefit from Index Only Scans:

create index i_t2_num_id on t2 using btree(num, id); vacuum t2;

This would be a great speed up:

test=# explain (analyze, buffers) select * from t2 where num > 10000 order by num limit 1000; ???????????????????????? ???????????????????????????????????????QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ ?Limit? (cost=0.43..21.79 rows=1000 width=16) (actual time=0.046..0.377 rows=1000 loops=1) ?? Buffers: shared hit=29 ?? ->? Index Only Scan using i_t2_num_id on t2? (cost=0.43..192896.16 rows=9034613 width=16) (actual time=0.044..0.254 rows=1000 loops=1) ???????? Index Cond: (num > 10000) ???????? Heap Fetches: 0 ??????? ?Buffers: shared hit=29 ?Planning Time: 0.211 ms ?Execution Time: 0.479 ms (8 rows)

– as few as 29 buffer hits, or just 232 KiB of data!

Again, without buffer numbers, we could see the difference anyway:?Index Only Scan?would explain why execution time went below 1ms, and?Heap Fetches: 0?would be a good signal that we have up-to-date visibility maps so Postgres didn't need to deal with heap at all.

Can we do even better with a 1-column index? Yes, but with some physical reorganization of the table. The?num?values were generated using?random(), so when the executor finds index entries performing?Index Scan, it then needs to deal with many various pages in heap. In other words, tuples with?num?values we need are stored sparsely in the table. We can reorganize the table using?CLUSTER?(in production, we would use some non-blocking method – for example,?pg_repack?could help):

drop index i_t2_num_id; -- not needed, we learn Index Scan behavior now cluster t2 using i_t2_num;

Checking the plan once again:

test=# explain (analyze, buffers) select * from t2 where num > 10000 order by num limit 1000; ??????????????????????????????????????????????????????? ????QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- ?Limit? (cost=0.43..24.79 rows=1000 width=16) (actual time=0.071..0.395 rows=1000 loops=1) ?? Buffers: shared hit=11 ?? ->? Index Scan using i_t2_num on t2? (cost=0.43..219998.90 rows=9034771 width=16) (actual time=0.068..0.273 rows=1000 loops=1) ???????? Index Cond: (num > 10000) ???????? Buffers: shared hit=11 ?Planning Time: 0.183 ms ?Execution Time: 0.491 ms (7 rows)

Just 11 buffer hits, or 88 KiB, to read 1000 rows! And sub-millisecond timing again. For?Index Scan. Would we understand the difference without?BUFFERS? Let me show you both execution plans, before and after?CLUSTER, so you could compare the plans yourself and see if we can understand the cause of the difference without using?BUFFERS:

before?CLUSTER?applied:

test=# explain analyze select * from t2 where num > 10000 order by num limit 1000; ??????????????????????????????????????????????????????????? QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit? (cost=0.43..24.79 rows=1000 width=16) (actual time=0.033..1.867 rows=1000 loops=1) ? ->? Index Scan using i_t2_num on t2? (cost=0.43..219996.68 rows=9034644 width=16) (actual time=0.031..1.787 rows=1000 loops=1) ??????? Index Cond: (num > 10000) Planning Time: 0.114 ms Execution Time: 1.935 ms (5 rows)

领英推荐

eKuiper 1.11.0 Released: Adds 50+ Functions and…

杭州映云科技有限公司 1 年前

Large Data Processing using OIC, Bulk Collect and For…

Conneqtion Group 2 年前

Passing Bulk Data to Stored Procedure to Get Improved…

Ciklum India 2 年前

after?CLUSTER?applied:

test=# explain analyze select * from t2 where num > 10000 order by num limit 1000; ??????????????????????????????????????????????????????????? QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Limit? (cost=0.43..24.79 rows=1000 width=16) (actual time=0.074..0.394 rows=1000 loops=1) ? ->? Index Scan using i_t2_num on t2? (cost=0.43..219998.90 rows=9034771 width=16) (actual time=0.072..0.287 rows=1000 loops=1) ??????? Index Cond: (num > 10000) Planning Time: 0.198 ms Execution Time: 0.471 ms (5 rows)

~4x difference in timing (1.935 ms?vs.?0.471 ms) is hard to explain here without using?BUFFERS?or without careful inspection of the database metadata (\d+ t2?would show that the table is clustered using?i_t2_num).

More cases where using?BUFFERS?would greatly help us understand what's happening inside so we could make a good decision on query optimization:

high level of table/index bloat,
very wide table (significant TOAST size),
HOT updates vs. index amplification.

Here I want to show another important example – a situation that is not uncommon to see on production OLTP systems with high rates of data changes.

In one psql session, start a transaction (with a real XID assigned) and leave it open:

test=*# select txid_current(); ?txid_current -------------- ??? 142719647 (1 row) test=*#

In another psql session, let's delete some rows in?t2?and perform VACUUM to clean up the dead tuples:

test=# delete from t2 where num > 10000 and num < 90000; DELETE 7998779 test=# vacuum t2; VACUUM

Now let's see what happened to our execution plan that we just made very fast using CLUSTER (remember we had as few as 11 buffer hits, or 88 KiB, to be able to get our 1000 target rows, and it was below 1ms to execute):

test=# explain (analyze, buffers) select * from t2 where num > 10000 order by num limit 1000; ???????????????????????????????????????????????????????????? QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- ?Limit? (cost=0.43..52.28 rows=1000 width=16) (actual time=345.347..345.431 rows=1000 loops=1) ?? Buffers: shared hit=50155 ?? ->? Index Scan using i_t2_num on t2? (cost=0.43..93372.27 rows=1800808 width=16) (actual time=345.345..345.393 rows=1000 loops=1) ???????? Index Cond: (num > 10000) ???????? Buffers: shared hit=50155 ?Planning Time: 0.222 ms ?Execution Time: 345.481 ms (7 rows)

It is now ~700x slower (345.481 ms?vs.?0.491 ms) and we have 50155 buffer hits, or 50155 * 8 / 1024 = ~392 MiB. And?VACUUM?doesn't help! To understand why, let's rerun?VACUUM, this time with?VERBOSE:

test=# vacuum verbose t2; INFO:? vacuuming "nik.t2" INFO:? launched 1 parallel vacuum worker for index cleanup (planned: 1) INFO:? table "t2": found 0 removable, 7999040 nonremovable row versions in 43239 out of 54055 pages DETAIL:? 7998779 dead row versions cannot be removed yet, oldest xmin: 142719647 Skipped 0 pages due to buffer pins, 10816 frozen pages. CPU: user: 0.25 s, system: 0.01 s, elapsed: 0.27 s. VACUUM

Here?7998779 dead row versions cannot be removed yet?means that?VACUUM?cannot clean up dead tuples. This is because we have a long-running transaction (remember we left it open in another psql session). And this explains why?SELECT?needs to read ~392 MiB from the buffer pool, so the query execution time exceeds 300 ms.

Again, the?BUFFERS?option helps us see the problem – without it, we would only know that execution time is very poor for reading 1000 rows using Index Scan, but only high buffer hits number tells us that we deal with high data volumes, not with some hardware problem, Postgres bug, or locking issues.

As already mentioned, Postgres query optimization is not a trivial area of engineering. One of the most important components of a good optimization workflow is the environment used for experimenting, execution plan analysis, and verification of optimization ideas.

First of all, a good environment for experiments must have realistic data sets. Not only do Postgres settings and table row counts matter (both define what plan the planner is going to choose for a query), but the data has to be up to date, and it should be easy to iterate. For example, if you refresh data from production periodically, it is a good move, but how long do you wait for it, and if you changed data significantly, how long does it take to do another experiment (in other words, to start from scratch)? If you conduct a lot of experiments, and especially if you are not alone in your team who is working on SQL optimization tasks, it is very important to be able:

on the one hand, to reset the state quickly,
on another, not to interfere with colleagues.

How can we achieve this? The key here is the approach to the optimization of a query. Many people consider timing the main metric in the execution plans (those with execution – obtained from running?EXPLAIN ANALYZE). This is natural – the overall timing value is what we aim to reduce when optimizing queries, and this is what?EXPLAIN ANALYZE?provides by default.

However, timing depends on many factors:

hardware performance,
filesystem and OS, and their settings,
the state of the caches (file cache and Postgres buffer pool),
concurrent activity and locking.

In the past, when performing SQL optimization, I also paid the most attention to the timing numbers. Until I realized:

Timing is volatile. Data volumes are stable.

Execution plans show data volumes in the form of?planned rows?and?actual rows?– however, these numbers are "too high level", they hide the information about how much real IO work Postgres needed to do to read or write those rows. While the?BUFFERS?option shows exactly how much IO was done!

Execution may take 1 millisecond on production to select 1000 rows, and 1 second on an experimental environment (or vice versa), and we can spend hours trying to understand why we have the difference.

With?BUFFERS, however, if we deal with same- or similar-size databases in both environments, we always work with the same or similar data volumes giving us the same (similar) buffer numbers. If we are lucky enough so we can work with production?clones?(keeping the same physical layout of data, same bloat, and so on) when troubleshooting/optimizing SQL queries, we will have:

exactly the same?buffer hits?if there are no reads if Postgres buffer pool is warmed up,
if the buffer pool is cold or not big enough on an experimental environment, then we'll see?buffer reads?that will translate to?buffer hits?on production / on a system with a warmed-up buffer pool state.

Use?EXPLAIN (ANALYZE, BUFFERS)?always, not just?EXPLAIN ANALYZE?– so you can see the actual IO work done by Postgres when executing queries.

This gives a better understanding of the data volumes involved. Even better if you start translating buffer numbers to bytes – just multiplying them by the block size (8 KiB in most cases).

Don't think about the timing numbers when you're inside the optimization process – it may feel counter-intuitive, but this is what allows you to forget about differences in environments. And this is what allows working with thin clones – look at Database Lab Engine and what other companies do with it.

Finally, when optimizing a query, if you managed to reduce the?BUFFERS?numbers, this means that to execute this query, Postgres will need fewer buffers in the buffer pool involved, reducing IO, minimizing risks of contention, and leaving more space in the buffer pool for something else. Following this approach may eventually provide a global positive effect for the general performance of your database.

要查看或添加评论，请登录

shyam kumar的更多文章

Understanding Transaction Wraparound in PostgreSQL

2024年10月23日

Understanding Transaction Wraparound in PostgreSQL

As a PostgreSQL DBA, terms like 'dead rows' and 'bloat' come up frequently in discussions. In this post, I’ll dive into…

1 条评论
Understanding Data Corruption in PostgreSQL

2024年6月25日

Understanding Data Corruption in PostgreSQL

Data corruption is a real challenge in any technical environment. Imperfections in technical equipment can lead to data…
Simplifying MySQL Replication with Global Transaction Identifiers (GTIDs)

2024年6月12日

Simplifying MySQL Replication with Global Transaction Identifiers (GTIDs)

In this article, we will dive into a more recent and acceptable approach to creating replication – using the Global…
Why we need VACUUM to implement MVCC In PostgreSQL

2024年6月5日

Why we need VACUUM to implement MVCC In PostgreSQL

There are several blogs on MVCC and VACUUM, but I really thank Robert M. Haas for explaining why we need VACUUM to…

EXPLAIN ANALYZE or EXPLAIN (ANALYZE, BUFFERS)

shyam kumar

Infra Ops specialist in Cognizant |EDB certified| PostgreSQL | Azure cloud administration certified | Google cloud certified

领英推荐

shyam kumar的更多文章

社区洞察

其他会员也浏览了

What is a SPARQL Endpoint?

An improved way of getting all integer values from a comma-separated string into a single column | Infogen Labs

Datetime vs Timestamp datatype in databases - Which one is better?

Why do databases store data in B+ trees?

SQL Query Optimization: Key Techniques for Boosting Performance at Both the Query and Source Level

Exactly How SQL is Used on the Job by Data Scientists - A Case Study

Design a Distributed Priority Queue

Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

Understanding Common Table Expressions (CTEs)

领英推荐

shyam kumar的更多文章

Understanding Transaction Wraparound in PostgreSQL

Understanding Data Corruption in PostgreSQL

Simplifying MySQL Replication with Global Transaction Identifiers (GTIDs)

Why we need VACUUM to implement MVCC In PostgreSQL

社区洞察

其他会员也浏览了

What is a SPARQL Endpoint?

An improved way of getting all integer values from a comma-separated string into a single column | Infogen Labs

Datetime vs Timestamp datatype in databases - Which one is better?

Why do databases store data in B+ trees?

SQL Query Optimization: Key Techniques for Boosting Performance at Both the Query and Source Level

Exactly How SQL is Used on the Job by Data Scientists - A Case Study

Design a Distributed Priority Queue

Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

Understanding Common Table Expressions (CTEs)