登录查看更多内容

Decoding Postgres: A Deep Dive into Database Internals for Performance Mastery! ??

??Prasenjit Sutradhar

Software Engineer (Backend & DevOps) ?? | Java, Spring Boot, Node.js, Kafka, Microservices, AWS ??? | Building Early-Stage Startups & SaaS Products ??

发布日期: 2024年1月8日

I am really excited about this section because we're going to start to dive into some really advanced stuff and this is where we really get into the world of Postgres as opposed to just release by itself. So in this article, we're going to be talking about performance.

So a couple of quick notes on performance in general. I could give you some quick tips on performance. I could tell you, Oh yeah, do X, Y, Z, do this, do that, give you like a little checklist to go through and that might work to some degree. But at the end of the day, if you want to really have some good performance inside your database, you have to understand what Postgres is actually doing.

Therefore,

you have to understand how information is being stored on a hard disk, how it gets accessed, and different tools that we have and how those tools precisely work to really understand the performance of your database.
We're going to take a look at something called indexes. We're going to understand the internals of indexes how they are used, and how they improve data access performance.
then we're going to start to put those two things together to really understand how queries are interpreted by the database and how they actually get executed.

So this section is not really about performance so much as it is about the internals of Postgres itself.

We are going to be extremely deep in detail. We're going to go as deep as looking at how information is stored by the database on the hard drive in terms of binary code. So like the actual ones and zeros. So we're going to go to a very, very deep, very low level.

??Where Does Postgres Store Data?

We're going to take a look at where Postgres stores information on your hard disk. To do so, we're going to first open up our query tool in pgAdmin and we're going to run a very small command here.

We make use of the show keyword anytime that we want to pull one individual configuration option out of our Postgres database. I'm going to run that and then I'll see the result is a path.

So this is the folder or the directory on my personal computer where Postgres is currently installed and running.

Now opening up this directory inside of a Window Explorer and taking a look at all the different files and folders that make up the Postgres database. So here's that directory that was just printed out inside of PgAdmin.

Once you get here, you're going to see a lot of different kind of mysterious-looking files and folders. Now if you open up that base folder and then inside here we'll see a collection of different folders, each with a number assigned to them.

Each of these different folders contains all the data for a different database running on our local machine. We can understand or figure out what each of these different identifiers or numbers relates to by running another query back inside of Pgadmin.

That will then list out the names of all the different databases inside of our Postgres installation and an internal identifier that is being used for each one.

If you carefully pay attention, then you’ll notice that the OIDs are the same as the filename of those folders in the base directory.

For example, the oid: 16424 is pointing to a database named Instagram, so let’s open it and see what it has inside.

Inside of here, I'm going to immediately see a ton of different files. So all the files you see right here, these files represent the raw data inside of our database. And when we look at this, it kind of looks like there are too many files for them to represent the individual tables that we have. And it looks like there are too few for it to be like maybe one file for every row and every table.

??So what do these files actually mean? What is stored inside of each one?

Well, we can print up what each of these different files is being used for by running another query back inside of Pgadmin.

then we're going to see a really big listing come up here.

So each of these different rows contains information about one of those different files that we're taking a look at. Each file, in turn, represents one individual object inside our database. The objects we have inside of our database are not just tables. We also have objects to refer to things that are called indexes or sequences, primary keys, and a couple of other things as well.

Right now, let's just try to find out where how our "users" table is actually stored. So I can take a look at the relname column. I'm going to find a row called Users.

The internal identifier that is being used for the “users” table is 16426. So we can go back over to that directory we were just taking a look at and find a file inside of here with an ID of 16426

So that file, in theory, contains all of the data for my entire “users” table.

Similarly, we can also take a look at another table say, “posts”.

Looks like that one has an identifier of 16438. By the way, these identifiers could be different on your system, so don’t worry.

That file contains all the data for our “posts” table.

Okay. So at this point in time, we've established that inside of your hard disk somewhere, Postgres is storing all the information tied to one individual database inside of one folder. So that's the folder that we're looking at right now and inside of that directory are a ton of different files. each of these different files represents one different thing inside the database.

An example of an individual thing is a table. so we identified two files that store all the data related to our “users” table and our “posts” table.

领英推荐

MySQL vs PostgreSQL: Replication

Vivek Bansal 6 个月前

Database Internals - Write Ahead Log File Structure

Vivek Bansal 1 年前

Mongo VS Postgres

Robert Yackobian 1 年前

???Understand Heaps, Blocks, and Tuples

I have now identified file 16426 as being the file that contains all the information about the “users” table. So we're now going to start to investigate how information is actually stored inside that file and get a better idea of how it eventually gets accessed as well.

To start to investigate that file, We do have to understand a couple of different pieces of terminology.

So our first piece of terminology is heap or heap file.

?What is a heap file?

This is a file that contains all the data related to one particular table. So in our case, or I should say in my case in particular, (your ID right here might be slightly different), file 16426 is a heap file. It contains all the information related to my “users” table. Similarly, file 16438 is also a heap file. It contains all the information related to my “posts” table.

Now, if you've ever studied data structures before, you might be familiar with a data structure called a heap. A heap data structure is very different than a heap file, so we don't really want to be thinking about a heap data structure at all when we're thinking about these heap file things. it's not really relevant in this case.

Our next term is tuple or item.

?What is a tuple?

So tuple, item, they are all referring to the same thing. These refer to individual rows from one particular table.

Then finally we've got a block or a page.

?What is a Block or Page?

Again, block or page referring to the exact same thing. A single heap file is divided into many different blocks or pages and then inside each block or page, we have many different tuples/items (i.e. rows).

Now because these different terms have a lot of different kinds of like alternate ways of referring to them. I just want to show you a quick little diagram to help you understand how these different terms are really related together.

All right. So here is a diagram of file: 16426 (i.e. file represents the “users” table). The entire file itself is referred to as the heap or heap file. The file is subdivided into many different blocks or pages, and then each block or page stores some number of tuples (.i.e. rows of that table).

Each block or page can store no tuple or can store many different items. By default, each of these blocks or pages is 8KB in size.

So we might have one block right here at first. Regardless of how many items/tuples are inside of it, it is always 8KB large. The next one, regardless of how many items are stored inside of it, is always 8KB large.

At this point in time, we kind of have an idea of what a heap file is. It stores some information. It is divided into a block or a page, and each block or page is 8KB large. But why? Why is anything like this?

??Why is each block 8KB large regardless of how many items are inside of it?

??Why do these blocks exist at all?

??Why don't we just directly store some number of users inside of here?

Well, these are some really important questions, and they're going to help us start to narrow down to one of the first big lessons about Postgres performance that we really need to understand. So let’s start answering those questions.

??Block Data Layout

We've now established that inside of a heap file, there is a collection of different blocks. Every block is 8KB large and inside of each of these blocks are our actual “users” table’s rows i.e. the actual data that we're trying to store inside of our table.

Now we're going to go one step deeper and really understand what is going on inside of these blocks. We're going to go to a very fine level of detail. We're really going to eventually understand how this block is physically stored on your hard drive.

So we're going to take a look at another diagram to help you understand a little bit more about that particular block.

So on the left-hand side, we are still looking at file16426. Inside that file, there is a collection of different blocks. Each block is assigned a number starting from 0. So the first block is called Block 0. Then Block 1, Block 2, and so on. As we mentioned just a moment ago, each block is 8KB large. We're going to take a look at block 1 in particular. The block 1 has this kind of structure on the right.

Now, when I say this kind of structure and you are looking at these kinds of boxes on here, you might be a little bit curious, like what do these boxes actually indicate?

Well, when we are thinking about block number one, we are really talking about some information that is being stored on the hard disk of your computer. Your hard disk stores information as binary. So zeros and ones. So as we are looking at this structure right here, we're essentially talking about what different zeros and ones inside this block are actually being used to store what piece of information.

So when you are looking at this diagram right here, we can kind of imagine that behind the scenes, we are looking at some collection of zeros and ones.

Compare this is the binary representation of block 1.

So the starting 1’s & 0’s (blue colored) are dedicated to storing some very particular type of information. for example, information about the block.
Maybe the next couple of 1’s & 0’s (yellow colored) and ones are being dedicated to storing some other kind of information.
Then eventually maybe the 1’s & 0’s at the very bottom are dedicated to storing yet another kind of information.

So right now, as you're looking at this diagram, this is kind of a representation of the actual physical data being stored on your hard drive and the purpose of every little 0 and 1 in there.

Okay, so inside the block we have :

Starting: Some collections of 1’s and 0’s (blue colored) at the very start of the block are made to represent information about the block itself.
Middle (actual data): The next section of 1’s and 0’s (green colored) contains information about the actual data that is stored inside this block. So information about the actual rows. This does not actually contain information for the rows themselves. So in other words, there's not necessarily any IDs or user names or anything like that inside of here. These are just little pieces of information that say where you can find the different rows that are stored inside the block.
Free Space: After that, we've got this big gray area in the middle that is dedicated to free space. By saying free space, I just mean to say there are some collections of 1’s & 0’s that are not actually being used by the block right now. So that is information or free space that can eventually be used to assign some kind of actual user data or whatever else we're trying to store inside of our table.
Bottom: Then finally at the very end is the actual data itself. So at the very end, we've got some collection of zeros and ones down here for information about tuple two perhaps, and then maybe some information about tuple no 1, perhaps.

So again, when we're looking at this diagram, we're really thinking about the raw zeros and ones and the goal of each section or segment of 0's and 1's.

Alex Bardeen

Software Engineer

1 年

??Prasenjit Sutradhar Fascinating insights! What do you think about the impact of database diversity on performance? Your expertise would greatly enrich the discussion. Share your thoughts! ??

1 次回应

Bivek Sutradhar

Service Consultant at Royal Enfield | Resolving Customer Issues and Enhancing Riding Experiences

1 年

Thanks for sharing

1 次回应

Sumit Kumar

STUDENT AT Dr. MGR Educational and Research Institute

1 年

Insightful??????

1 次回应

Aditya Raj

Aspiring Software Developer || GSSOC’24 || MERN || DSA || BOSSCODER

1 年

Great share on Postgres internals ??

1 次回应

Apaar Raj Sharma

1 年

Thanks for sharing

1 次回应

查看更多评论

要查看或添加评论，请登录

??Prasenjit Sutradhar的更多文章

Demystifying Instagram's Database Design

2023年12月14日

Demystifying Instagram's Database Design

???Here's the Plan In the following sections, we embark on a journey to reconstruct the database schema that powers…

8 条评论
SSL, TLS & HTTPS - Explained with Hands-On example with OpenSSL

2023年7月31日

SSL, TLS & HTTPS - Explained with Hands-On example with OpenSSL

??Encrypted Connections with SSL and TLS Before we add any security or authentication, one of the things that we'll be…
Ultimate Guide to NodeJS Performance feat. Clustering, PM2, Worker Threads

2023年7月27日

Ultimate Guide to NodeJS Performance feat. Clustering, PM2, Worker Threads

Clustering is a technique used to horizontally scale a Node.js server on a single machine by spawning child processes…

6 条评论
A Comprehensive guide to Concurrency, Threading, and Event-driven Programming In Node.JS

2023年6月12日

A Comprehensive guide to Concurrency, Threading, and Event-driven Programming In Node.JS

??Comparing Node With PHP and Python As you're learning about the internals of Node, it's worth taking a step back and…

2 条评论
Unraveling NodeJS: Understanding Non-Blocking I/O, Event Loop, and Thread Model in Depth

2023年5月18日

Unraveling NodeJS: Understanding Non-Blocking I/O, Event Loop, and Thread Model in Depth

NodeJS is a popular server-side platform that has gained widespread adoption among developers due to its efficient and…

5 条评论
Uncovering the Depths of NodeJS Internals: From JavaScript to Low-Level C & C++ Implementations

2023年5月15日

Uncovering the Depths of NodeJS Internals: From JavaScript to Low-Level C & C++ Implementations

You may already be familiar with the JavaScript engine and its role in runtimes such as NodeJS and web browsers…

4 条评论
Exploring Node.js: Release Cycles, Versions, REPL, and more

2023年5月8日

Exploring Node.js: Release Cycles, Versions, REPL, and more

Welcome to my Node.js blog series! Today, we'll be exploring some key topics that every Node.
JavaScript Runtime Demystified!

2023年5月4日

JavaScript Runtime Demystified!

Before we dive into understanding runtime, we need to talk about how NodeJS was born. But in order to talk about how…
Module Federation & Micro-Frontends in Webpack 5

2023年3月13日

Module Federation & Micro-Frontends in Webpack 5

In order to understand module federation, you first need to have two standalone applications that can be run…

1 条评论
Code Splitting in Vanilla JS applications

2023年3月9日

Code Splitting in Vanilla JS applications

??Why do we need code splitting? As you might know that Webpack is mostly used in order to take all our modules and put…

See all articles

Decoding Postgres: A Deep Dive into Database Internals for Performance Mastery! ??

??Prasenjit Sutradhar

Software Engineer (Backend & DevOps) ?? | Java, Spring Boot, Node.js, Kafka, Microservices, AWS ??? | Building Early-Stage Startups & SaaS Products ??

??Where Does Postgres Store Data?

领英推荐

???Understand Heaps, Blocks, and Tuples

?What is a heap file?

?What is a tuple?

?What is a Block or Page?

??Block Data Layout

??Prasenjit Sutradhar的更多文章

社区洞察

其他会员也浏览了

Upgrading ReportPortal with backup/restore of Postgres and binary data

From Planning to Performance: Your Ultimate Guide to PostgreSQL Migration

High-Performance PostgreSQL: A Dive Into the Internals

Migrating an On-Premises SQL Server Database to PostgreSQL on Azure: My High-Level Game Plan

A Step-by-Step Guide to Installing Trino for Data Migration

What is going on during optimization in PostgreSQL?

Navigating PostgreSQL and SQL Server: A Journey Through Queries and Error Messages

Distributed PostgreSQL without Sharding constraint for SQL joins

Step-by-Step Guide to Setting Up a PostgreSQL Database and User for Application Development

Understanding the Difference Between SQL and PostgreSQL: Which One Should You Choose for Your Project?

??Where Does Postgres Store Data?

领英推荐

???Understand Heaps, Blocks, and Tuples

?What is a heap file?

?What is a tuple?

?What is a Block or Page?

??Block Data Layout

??Prasenjit Sutradhar的更多文章

Demystifying Instagram's Database Design

SSL, TLS & HTTPS - Explained with Hands-On example with OpenSSL

Ultimate Guide to NodeJS Performance feat. Clustering, PM2, Worker Threads

A Comprehensive guide to Concurrency, Threading, and Event-driven Programming In Node.JS

Unraveling NodeJS: Understanding Non-Blocking I/O, Event Loop, and Thread Model in Depth

Uncovering the Depths of NodeJS Internals: From JavaScript to Low-Level C & C++ Implementations

Exploring Node.js: Release Cycles, Versions, REPL, and more

JavaScript Runtime Demystified!

Module Federation & Micro-Frontends in Webpack 5

Code Splitting in Vanilla JS applications

社区洞察

其他会员也浏览了

Upgrading ReportPortal with backup/restore of Postgres and binary data

From Planning to Performance: Your Ultimate Guide to PostgreSQL Migration

High-Performance PostgreSQL: A Dive Into the Internals

Migrating an On-Premises SQL Server Database to PostgreSQL on Azure: My High-Level Game Plan

A Step-by-Step Guide to Installing Trino for Data Migration

What is going on during optimization in PostgreSQL?

Navigating PostgreSQL and SQL Server: A Journey Through Queries and Error Messages

Distributed PostgreSQL without Sharding constraint for SQL joins

Step-by-Step Guide to Setting Up a PostgreSQL Database and User for Application Development

Understanding the Difference Between SQL and PostgreSQL: Which One Should You Choose for Your Project?