How we upgraded PostgreSQL?Major version for multiple mission-critical applications

How we upgraded PostgreSQL?Major version for multiple mission-critical applications

Upgrading Postgres Major versions has always been complex. Recently we upgraded PostgreSQL from version 9.6 to 11.8 for one of our production clusters, which hosts 70+ mission-critical databases.

We faced numerous issues after the upgrade which left us with an overwhelming enriching experience and an urge to share with a wider audience.

We completed our upgrade activities and validated that the cluster looked fine, with all the environment variables set and connectivity verified. We were even way ahead of time completing the upgrade in an hour, including the post checks. However, when we were about to close the maintenance window, we noticed CPU and load average spiking up. 

No alt text provided for this image
No alt text provided for this image

We were quite surprised as we had previously upgraded 2 production clusters seamlessly following the same strategy. We didn't notice such behavior in any other applications.

We ran pg_top command as a Postgres user and found out the application name (let's call it app1). First, we thought this is because of the old connections which didn't get terminated. So we stopped the application services, killed the idle connections to the database, and restarted the application services.

#number of connections for each database
$ : select datname, count(*) from pg_stat_activity group by datname;


#Close idle connections
$ : select datname, state, count(*) from pg_stat_activity group by datname,state;
$ : select count (*) FROM pg_stat_activity WHERE datname = <database_name> and state='idle';
$ : select pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = <database_name> and state='idle';
After the restart of the application services, the issue still persisted. 

As we had less time and the downtime of the application was increasing. To provide temporary aid to the situation, we increased the CPU from 14 to 24. The CPU usage dropped for some time but again came back to 100%.

When we had stopped the application services, the load average and CPU had come back to normal, which indicated that only app1 was the affected application. We took a decision to move the application to a non-upgraded cluster.

But then there was another challenge:

Restoring the database from a higher version to a lower version.

Luckily the upgrade strategy that we use is that we run pg_upgrade on the primary node and then take basebackup on other replicas to bring them up. So, we still had old version data in one of our replica nodes, which eased our way.

#starting postgres using the old data directory and 9.6 version pg_ctl
$ : pg_ctl start -D $OLD_PGDATA_DIR


#take dump using the old pg_dump instance in the replica node and 9.6 version pg_dump
$ : pg_dump -h <slave-node-hostname> -U postgres app1 > app1_dump.sql


#restoring on non-upgraded cluster
#create database 
$ : create database app1
#create necessary roles
$ : create role app1_owner
#restore the dump using 9.6 version psql
$ : psql -h <cluster-hostname> -U postgres -d app1 -f app1_dump.sql

The app1 application came back up and the load average, CPU also came back to normal.

Since we didn't observe this behavior from any other application, We thought that the issue is application-specific.

However, 

After 3 days, Another application (let's call it app2) reported slowness in their queries. The queries which took seconds to complete were taking 3-4 minutes to complete. The application service wasn't down but they were facing intermittent timeout errors.

We started seeing load average and CPU spikes again
No alt text provided for this image
No alt text provided for this image

We realized we had encountered the same issue again. We parallelly reached out to other tenants in the database to check the performance of their queries. We wanted an efficient solution at this point, so we decided to triage the issue instead of a safe bet to moving app2 also in the non-upgraded cluster. Also, the app2 database was 200 GB big, and its dump and restore to another cluster didn't seem a favorable option.

The first approach that we followed was Fetching the table giving slow performance.

  • Application team helped identify the table(lets call it table1) which they were accessing in the queries, for which performance had dropped.
#We ran table vaccum to reclaim storage occupied by dead tuples and make it available for re-use. 
postgres=# VACCUM ANALYZE table1;
  • The size of the table reduced from 60 GB to 22 GB, but the issue still persisted.
  • Since we saw a significant drop in the size of the table, we decided to run the same for the whole database.
#running full database vaccum on app2 database 
$ : vacuumdb -f -d app2
  • The size of the database came down from 200 GB to 48 GB now, but the issue still persisted

The second approach that we followed was Fetching the right query to get to the root of the issue

#We monitored the pg_stat_activity table for app2 database while accessing the UI to get the right query 
$ : select * from pg_stat_activity where datname='app2';
  • We fetched a select query which took around 3 mins to complete, because of which the UI of the application was getting timed out.
  • Next we Analyzed the query statistics
$ : explain analyze <query>;
  • This helped us to know that the query is running a sequential scan when ideally queries should be using indexes.
  • We tried setting the index scan by :
postgres=# set enable_seqscan=off;
SET
postgres=# show enable_seqscan;
 enable_seqscan
----------------
 off
  • We cannot force set this parameter for the whole cluster/database as it will lead to failed queries that require a sequential scan.
This approach helped identify the root cause of the issue that queries were going on a sequential scan but we still needed a solution.

Meanwhile another database ( let's call it app3), also started consuming 99% CPU.

The third approach that we followed was Debugging the issue at the cluster level

We started scrutinizing different configuration files at the cluster level, but everything seemed fine

Since we had noticed the queries were doing the sequential scan and that vacuuming databases were cleaning space, we were close to resolving the issue. To improve query optimization and clean up space occupied by dead tuples at the cluster level, we ran the following command:

vacuumdb --all --analyze-in-stages

Performing this step will generate minimal optimizer statistics rapidly so your system is usable, and then gather statistics twice more with increasing accuracy. When it is done, your system will have the default level of optimizer statistics.

The command ran for 4 hours. and the end result:

$ : top 
load average: 1.57, 1.41, 1.55
The load average dropped and the queries were optimized to use the index scan, which consequently increased the performance, and the issue was resolved.

Summary

Every table must be analyzed after a major upgrade. Whether you upgrade via pg_dump, pg_upgrade, or some other means, the pg_statistic table is not copied over and the database starts as a clean slate. Running ANALYZE is thus the first important post-upgrade step.

For more info: importance of analyze

Star this step in your upgrade strategies for a seamless and efficient upgrade :)

No alt text provided for this image

The journey to solve this issue will always remain a memorable one. At every step, it seemed difficult to come out of the situation which was getting worst with each hour passing. Times like these call for perseverance, well-thought decisions, and a strong team. Our project team came together while troubleshooting it which helped incorporate diverse approaches in less time.

The decision to move other databases to a non-upgraded cluster was a safe bet. Instead, We faced the problem and solved it which made all the difference. That solution would have been fast but not efficient. 

In the end, we got a reliable and efficient upgrade plan for subsequent ones.



要查看或添加评论,请登录

Soumya Reja的更多文章

社区洞察

其他会员也浏览了