Decoding Postgres: A Deep Dive into Database Internals for Performance Mastery! ??
??Prasenjit Sutradhar
Software Engineer (Backend & DevOps) ?? | Java, Spring Boot, Node.js, Kafka, Microservices, AWS ??? | Building Early-Stage Startups & SaaS Products ??
I am really excited about this section because we're going to start to dive into some really advanced stuff and this is where we really get into the world of Postgres as opposed to just release by itself. So in this article, we're going to be talking about performance.
So a couple of quick notes on performance in general. I could give you some quick tips on performance. I could tell you, Oh yeah, do X, Y, Z, do this, do that, give you like a little checklist to go through and that might work to some degree. But at the end of the day, if you want to really have some good performance inside your database, you have to understand what Postgres is actually doing.
Therefore,
So this section is not really about performance so much as it is about the internals of Postgres itself.
We are going to be extremely deep in detail. We're going to go as deep as looking at how information is stored by the database on the hard drive in terms of binary code. So like the actual ones and zeros. So we're going to go to a very, very deep, very low level.
??Where Does Postgres Store Data?
We're going to take a look at where Postgres stores information on your hard disk. To do so, we're going to first open up our query tool in pgAdmin and we're going to run a very small command here.
We make use of the show keyword anytime that we want to pull one individual configuration option out of our Postgres database. I'm going to run that and then I'll see the result is a path.
So this is the folder or the directory on my personal computer where Postgres is currently installed and running.
Now opening up this directory inside of a Window Explorer and taking a look at all the different files and folders that make up the Postgres database. So here's that directory that was just printed out inside of PgAdmin.
Once you get here, you're going to see a lot of different kind of mysterious-looking files and folders. Now if you open up that base folder and then inside here we'll see a collection of different folders, each with a number assigned to them.
Each of these different folders contains all the data for a different database running on our local machine. We can understand or figure out what each of these different identifiers or numbers relates to by running another query back inside of Pgadmin.
That will then list out the names of all the different databases inside of our Postgres installation and an internal identifier that is being used for each one.
If you carefully pay attention, then you’ll notice that the OIDs are the same as the filename of those folders in the base directory.
For example, the oid: 16424 is pointing to a database named Instagram, so let’s open it and see what it has inside.
Inside of here, I'm going to immediately see a ton of different files. So all the files you see right here, these files represent the raw data inside of our database. And when we look at this, it kind of looks like there are too many files for them to represent the individual tables that we have. And it looks like there are too few for it to be like maybe one file for every row and every table.
??So what do these files actually mean? What is stored inside of each one?
Well, we can print up what each of these different files is being used for by running another query back inside of Pgadmin.
then we're going to see a really big listing come up here.
So each of these different rows contains information about one of those different files that we're taking a look at. Each file, in turn, represents one individual object inside our database. The objects we have inside of our database are not just tables. We also have objects to refer to things that are called indexes or sequences, primary keys, and a couple of other things as well.
Right now, let's just try to find out where how our "users" table is actually stored. So I can take a look at the relname column. I'm going to find a row called Users.
The internal identifier that is being used for the “users” table is 16426. So we can go back over to that directory we were just taking a look at and find a file inside of here with an ID of 16426
So that file, in theory, contains all of the data for my entire “users” table.
Similarly, we can also take a look at another table say, “posts”.
Looks like that one has an identifier of 16438. By the way, these identifiers could be different on your system, so don’t worry.
That file contains all the data for our “posts” table.
Okay. So at this point in time, we've established that inside of your hard disk somewhere, Postgres is storing all the information tied to one individual database inside of one folder. So that's the folder that we're looking at right now and inside of that directory are a ton of different files. each of these different files represents one different thing inside the database.
An example of an individual thing is a table. so we identified two files that store all the data related to our “users” table and our “posts” table.
领英推荐
???Understand Heaps, Blocks, and Tuples
I have now identified file 16426 as being the file that contains all the information about the “users” table. So we're now going to start to investigate how information is actually stored inside that file and get a better idea of how it eventually gets accessed as well.
To start to investigate that file, We do have to understand a couple of different pieces of terminology.
So our first piece of terminology is heap or heap file.
?What is a heap file?
This is a file that contains all the data related to one particular table. So in our case, or I should say in my case in particular, (your ID right here might be slightly different), file 16426 is a heap file. It contains all the information related to my “users” table. Similarly, file 16438 is also a heap file. It contains all the information related to my “posts” table.
Now, if you've ever studied data structures before, you might be familiar with a data structure called a heap. A heap data structure is very different than a heap file, so we don't really want to be thinking about a heap data structure at all when we're thinking about these heap file things. it's not really relevant in this case.
Our next term is tuple or item.
?What is a tuple?
So tuple, item, they are all referring to the same thing. These refer to individual rows from one particular table.
Then finally we've got a block or a page.
?What is a Block or Page?
Again, block or page referring to the exact same thing. A single heap file is divided into many different blocks or pages and then inside each block or page, we have many different tuples/items (i.e. rows).
Now because these different terms have a lot of different kinds of like alternate ways of referring to them. I just want to show you a quick little diagram to help you understand how these different terms are really related together.
All right. So here is a diagram of file: 16426 (i.e. file represents the “users” table). The entire file itself is referred to as the heap or heap file. The file is subdivided into many different blocks or pages, and then each block or page stores some number of tuples (.i.e. rows of that table).
Each block or page can store no tuple or can store many different items. By default, each of these blocks or pages is 8KB in size.
So we might have one block right here at first. Regardless of how many items/tuples are inside of it, it is always 8KB large. The next one, regardless of how many items are stored inside of it, is always 8KB large.
At this point in time, we kind of have an idea of what a heap file is. It stores some information. It is divided into a block or a page, and each block or page is 8KB large. But why? Why is anything like this?
??Why is each block 8KB large regardless of how many items are inside of it?
??Why do these blocks exist at all?
??Why don't we just directly store some number of users inside of here?
Well, these are some really important questions, and they're going to help us start to narrow down to one of the first big lessons about Postgres performance that we really need to understand. So let’s start answering those questions.
??Block Data Layout
We've now established that inside of a heap file, there is a collection of different blocks. Every block is 8KB large and inside of each of these blocks are our actual “users” table’s rows i.e. the actual data that we're trying to store inside of our table.
Now we're going to go one step deeper and really understand what is going on inside of these blocks. We're going to go to a very fine level of detail. We're really going to eventually understand how this block is physically stored on your hard drive.
So we're going to take a look at another diagram to help you understand a little bit more about that particular block.
So on the left-hand side, we are still looking at file16426. Inside that file, there is a collection of different blocks. Each block is assigned a number starting from 0. So the first block is called Block 0. Then Block 1, Block 2, and so on. As we mentioned just a moment ago, each block is 8KB large. We're going to take a look at block 1 in particular. The block 1 has this kind of structure on the right.
Now, when I say this kind of structure and you are looking at these kinds of boxes on here, you might be a little bit curious, like what do these boxes actually indicate?
Well, when we are thinking about block number one, we are really talking about some information that is being stored on the hard disk of your computer. Your hard disk stores information as binary. So zeros and ones. So as we are looking at this structure right here, we're essentially talking about what different zeros and ones inside this block are actually being used to store what piece of information.
So when you are looking at this diagram right here, we can kind of imagine that behind the scenes, we are looking at some collection of zeros and ones.
Compare this is the binary representation of block 1.
So right now, as you're looking at this diagram, this is kind of a representation of the actual physical data being stored on your hard drive and the purpose of every little 0 and 1 in there.
Okay, so inside the block we have :
So again, when we're looking at this diagram, we're really thinking about the raw zeros and ones and the goal of each section or segment of 0's and 1's.
Software Engineer
1 年??Prasenjit Sutradhar Fascinating insights! What do you think about the impact of database diversity on performance? Your expertise would greatly enrich the discussion. Share your thoughts! ??
Service Consultant at Royal Enfield | Resolving Customer Issues and Enhancing Riding Experiences
1 年Thanks for sharing
STUDENT AT Dr. MGR Educational and Research Institute
1 年Insightful??????
Aspiring Software Developer || GSSOC’24 || MERN || DSA || BOSSCODER
1 年Great share on Postgres internals ??
SDET @IBM | Java | Selenium | DSA | System Design | Dedicated to driving innovation and delivering robust solutions.
1 年Thanks for sharing