PGSQL Phriday #010: My Go-To Solution for Postgres Issues - pgBadger
Welcome to another installment of the 'PGSQL Phriday' challenge! Today, we'll be diving deep into the fantastic world of pgBadger, a powerful tool that has transformed my approach to Postgres troubleshooting.
My introduction to pgBadger happened quite casually at my very first IT job, which was at a bank in Poland. Every time one of our Postgres instances threw a tantrum (which, by the way, you could totally tell by the flurry of nervous activity in our massive open office), we'd whip up a pgBadger report using the logs from the troubled instance. This handy tool quickly became one of my go-to problem-solvers, and I've carried it with me through every job and company since.
This was particularly true when I embarked on my consulting journey, working with businesses using both on-prem servers and operating in cloud environments. Among the initial things I'd request were:
psql -Atc "SELECT now(), * FROM pg_stat_bgwriter" >> pg_stat_bgwriter.csv, the same for pg_stat_user_tables, pg_stat_activity etc.
2. OS level metrics, such as CPU, memory, IOPS usage, and more.
3. Basic instance data: version, SKU (in cloud-speak, a simple count of CPUs, memory, and storage), databases and object sizes, parameter settings (I typically use the output of the SHOW ALL command –since I'm too lazy to recalculate units).
4. And of course, the pgBadger.
Armed with these four components, I can efficiently troubleshoot 90% of the cases that come my way.
Diving into pgBadger
This blog post is specifically about pgBadger, so let's dive into the sections of the report that have proven most helpful in a majority of my cases, starting with the ones that pique my interest the most.
VACUUMS Tab
As the name indicates you will find here information about autovacuum actions, both ANALYZE and VACUUM ??Manually executed VACUUMs can be found in the “Top” tab.
My usual approach is to zoom straight into the "Vacuums per table" chart and spot any unusual behavior. If a single table dominates the pie chart (as shown in the picture below), I proceed to the "Tuples removed per table" section and switch to the table view.
From the figures on the table, I can quickly calculate the average number of tuples (rows) removed in each round. For instance, in the case of the nightly.public.gpu table, it would be 33452 divided by 67, equating to approximately 500 tuples. Not too shabby, right?
However, it's unfortunate to often see instances where a single table has undergone thousands of VACUUM operations per day, with only a few rows being removed each time, while other tables are left untouched.
In such situations, it's essential to inspect the size and the transactions performed on this heavily vacuumed table. More often than not, you'll find it's a relatively small table that sees frequent updates, leading autovacuum to prioritize it. But given its small size - typically just a few MBs - there's no real necessity for such frequent vacuuming.
领英推荐
At this point, you can tweak the autovacuum - vacuum settings specifically for this table as follows:
ALTER TABLE gpu SET (autovacuum_vacuum_scale_factor = 0);
ALTER TABLE gpu SET (autovacuum_vacuum_threshold = 10000);
Having made these changes, you then take a step back for about a week and revisit the charts to see if the table is getting the break it deserves. Also, be on the lookout for any other tables that might have risen to the top of the list.
Temp Files Tab
Just as before, we'd scroll down to the "Queries generating the most temporary files (N)" list, paying particular attention to the average sizes at the top. Suppose you come across something like this in a sample report:
If most of the queries are generating temporary files around 12MB in size, it's a no-brainer to adjust your work_mem to approximately 20MB (just enough to be slightly larger than the temp file size). I should caution you though: remember to consider your total memory and the number of queries that can simultaneously utilize work_mem. That said, in this day and age, 20MB per operation shouldn't pose much of an issue, right? (Of course, when setting work_mem to higher values, a bit more caution is advised).
If you want to check if the value is the right one, there's a way to quickly verify if 20MB will do the trick. Simply set the value in the session first:
SET work_mem = '20MB';
Then, run the top query with EXPLAIN ANALYZE. If you see "quicksort Memory" in the output, you're good to go!
"Top" Tab
The "Top" tab, of course! Simply by glancing at the "Time consuming queries (N)" table, you can resolve a good deal of customer issues. I won't dive into the nitty-gritty of query tuning here, as this blog post is already long enough, and there are plenty of excellent resources available on the topic. However, this section more or less serves as a visual counterpart to the pg_stat_statements extension, wouldn't you agree?
"Events" Tab
This is where you want to be vigilant about any errors, fatal messages, or warnings that your instance might be producing. Here, you can quickly view these issues in a cleanly aggregated format, saving you from trawling through often massive text files. This tab also gives you visibility into locks, application issues, and even nefarious activities like attempted breaches.
And with that, I'd like to extend a heartfelt thanks to Gilles Darold for creating such an extraordinary tool. pgBadger has been instrumental in helping both me and my clients resolve numerous outages and fine-tune our databases.
Cz?onek zarz?du zarz?dzaj?cy departamentem IT (CIO) w Soonly Finance, Head Of Information Technology Department at Soonly Finance
1 年Great article Ala!