登录查看更多内容

[Postgres] How to speed up bulk load

Nikolay Samokhvalov

?? Let's make your Postgres healthy now. Reach out to me: [email protected] // I stand with Ukraine ????

发布日期: 2023年12月23日

I post a new PostgreSQL "howto" article every day. Join me in this journey – subscribe here or on X , provide feedback, share!

If you need to load a lot of data, here are the tips that can help you do it faster.

1) COPY

Use COPY to load data, it's optimized for bulk load.

2) Less frequent checkpoints

Consider increasing max_wal_size and checkpoint_timeout temporarily.

Changing them does not require restart.

Increased values lead to increased recovery time in case of failure, but benefit is that checkpoints occur less often, therefore:

less stress on disk,
less WAL data is written, thanks to decreased number of full page writes of the same pages (when load happens with existing indexes).

3) Larger buffer pool

Increase shared_buffers, if you can.

4) No (or fewer) indexes

If load happens into a new table, create indexes after data load. When loading into an existing table, avoid over-indexing .

Every additional index will significantly slow down the load.

领英推荐

SELECT News From Yugabyte - Aug 22

Yugabyte 2 年前

Top 5 Most Popular Databases in 2023

LearnSQL.com 1 年前

Covering Index nuances: which columns to cover (WHERE,…

Franck Pachot 10 个月前

w5) No (or fewer) FKs and triggers

Similarly to indexes, foreign key constraints and triggers may significantly slow down data load – consider (re)creating them after the bulk load.

Triggers can be disabled via ALTER TABLE … DISABLE TRIGGERS ALL – however, if triggers support some consistency checks, you need to make sure that those checks are not violated (e.g., run additional checks after data load). FKs are implemented via implicit triggers, and ALTER TABLE … DISABLE TRIGGERS ALL disables them too – loading data in this state should be done with care.

6) Avoiding WAL writes

If this is a new table, consider completely avoiding WAL writes during the data load. Two options (both have limitations and require understanding that data can be lost if a crash happens):

Use unlogged table: CREATE UNLOGGED TABLE …. Unlogged tables are not archived, not replicated, they are not persistent (though, they survive normal restarts). However, converting an unlogged table to a normal one takes time (likely, a lot –?worth testing), because he data needs to be written to WAL. More about unlogged tables in this Crunchy Data post ; also, see this StackOverflow discussion .
Use COPY with wal_level ='minimal'. COPY has to be executed inside the transaction that created the table. In this case, due to wal_level ='minimal', COPY writes won't be written to WAL (as of PG16, this is so only if table is unpartitioned). Additionally, consider using COPY (FREEZE) – this approach also provides a benefit: all tuples are frozen after the data load. Setting wal_level='minimal', unfortunately, requires a restart, and additional changes (archive_mode = 'off', max_wal_senders = 0). Of course, this method doesn't work well in most of the production cases, but can be good for single-server setups. Details for the wal_level='minimal' + COPY (FREEZE) recipe in this Cybertec post .

7) Parallelization

Consider parallelization. This may or may not speed up the process, depending on the bottlenecks of the single-threaded process (e.g., if single-threaded load saturates disk IO, parallelization won't help). Two options:

Partitioned tables and loading into multiple partitions using multiple workers (Day 20: pg_restore tips ).
Unpartitioned table and loading in big chunks. Such chunks require preparation of them – it can be CSV split into pieces, or exported ranges of table data using multiple synchronized REPEATABLE READ transactions (working with the same snapshot via SET TRANSACTION SNAPSHOT; see Day 8: How to speed up pg_dump .

If you use TimescaleDB, consider timescaledb-parallel-copy .

Last but not least: after a massive data load, don't forget to run ANALYZE.

Marathon progress: ▓??????????????????? 8.77%

This series is also available in Markdown format: https://gitlab.com/postgres-ai/postgresql-consulting/postgres-howtos

Mohammed Lokhandwala

2 个月

Nikolay, Nice!

Sandip Dey

Senior Engineer at Nagarro

11 个月

Any built in function developed for that ?

查看更多评论

要查看或添加评论，请登录

查看全部

[Postgres] How to speed up bulk load

Nikolay Samokhvalov

?? Let's make your Postgres healthy now. Reach out to me: [email protected] // I stand with Ukraine ????

1) COPY

2) Less frequent checkpoints

3) Larger buffer pool

4) No (or fewer) indexes

领英推荐

w5) No (or fewer) FKs and triggers

6) Avoiding WAL writes

7) Parallelization

更多精彩文章

社区洞察

其他会员也浏览了

Covering Index nuances: which columns to cover (WHERE, ORDER BY, LIMIT, SELECT)?

SQL SERVER – Finding the Last Used Stored Procedure

Working with JSON data in PostgreSQL

How to optimize the SELECT queries?

Graph Processing in SQL Server 2017 by David Glass

The Hidden Challenge of PostgreSQL Partitioning

Window functions in PostgreSQL: The secret weapon of SQL ninjas

Implementing Inline Table-Valued Functions in PostgreSQL for Efficient Data Retrieval and Transformation

SQL Server 2022 Public Preview is here!

PostgreSQL Performance Tuning for Application Developers: Hands-On Guide to Speed

1) COPY

2) Less frequent checkpoints

3) Larger buffer pool

4) No (or fewer) indexes

领英推荐

w5) No (or fewer) FKs and triggers

6) Avoiding WAL writes

7) Parallelization

How many TPS can we get from a single Postgres node?

2024年6月27日

How to perform initial / rough Postgres tuning

2023年12月29日

[Postgres] How to redefine a PK without downtime

2023年12月27日

[Postgres] How to troubleshoot a growing pg_wal directory

2023年12月23日

[Postgres] How to deal with long-running transactions

2023年12月21日

[Postgres] How to work with arrays, part 2

2023年12月3日

[Postgres] How to work with arrays, part 1

2023年11月30日

[Postgres] How to work with metadata

2023年11月21日

How to use OpenAI APIs right from Postgres to implement semantic search and GPT chat

2023年11月20日

[Postgres] How to analyze heavyweight locks, part 1

2023年11月17日

社区洞察

其他会员也浏览了

Covering Index nuances: which columns to cover (WHERE, ORDER BY, LIMIT, SELECT)?

SQL SERVER – Finding the Last Used Stored Procedure

Working with JSON data in PostgreSQL

How to optimize the SELECT queries?

Graph Processing in SQL Server 2017 by David Glass

The Hidden Challenge of PostgreSQL Partitioning

Window functions in PostgreSQL: The secret weapon of SQL ninjas

Implementing Inline Table-Valued Functions in PostgreSQL for Efficient Data Retrieval and Transformation

SQL Server 2022 Public Preview is here!

PostgreSQL Performance Tuning for Application Developers: Hands-On Guide to Speed