How does Postgres persist to disk? What is WAL all about?
Introduction
Hello folks! in this quick article i’m going to be talking about how a database like Postgres actually persists to disk and what happens behind the scenes. What is WAL all about and what does it even stand for? Let’s dive in.
When it comes to I/O operations (in our case writing to disk). Optimizations become a must because I/O operations are very expensive especially writing to disk due to a lot of factors. Postgres tackles this problem in a clever manner.
When a write request is made to the database, You’d think that it would persist it to disk right away, but that’s not what actually happens behind the scenes. Introducing Postgres Shared Buffers
Shared Buffers
When a write is made it isn’t instantly flushed to disk, Postgres actually loads the page getting impacted by the write to memory and adjusts it in memory.
Postgres saves rows in pages and whenever we update a certain row we find the page it’s in, load it to memory and update the page with the new record respectively. Same for writes whether its a new page or an existing one that’s not yet full.
This is called a dirty page where it needs to be written back to disk later. Later on Postgres relies on background processes like the background writer or checkpointsto write dirty pages from shared buffers to disk asynchronously. (We’ll get into that)
But the thing is this approach hugely minimizes I/O by batching the flush operation to disk and not have it going on for each and every write.
On the other hand, when a read request is made Postgres first checks the shared buffers for the pages being requested (whether dirty or not). If found it then proceeds to serve it from there, if not it loads the page from disk and adds it to the shared buffers cache. If the shared buffer is full a victim dirty page will be written to disk and the newly read page will be replaced by it.
领英推荐
Now the question that arises is what happens if you lose power half-way through writing to the data files? Let’s say some write was made to memory and during flushing it to disk a power cut happened causing that data to be inconsistent and weird. The client still thinks he made that write when In reality it never persisted to disk completely. This can cause a lot of data corruption and integrity loss. Hence introducing WAL
WAL (Write Ahead Log)
WAL or Write Ahead Log is a mechanism to ensure the consistency and safety of data. It is a technique where every change to the database is logged before it is applied to the actual data files. This log acts as a journal that records all modifications. It is an append only log that logs everything a user writes to the database while simultaneously updating the data pages in shared buffer as mentioned before. Any newly written data can remain in the shared buffer as long as we have a log that tracked the change. In case anything goes wrong we can reconstruct the state from the log.
The WAL is much smaller than the actual data files, and so we can flush them relatively faster. They are also sequential unlike data pages which are random. Disks love sequential writes, which is another benefit of WAL.
Every WAL entry is first written to a WAL buffer in memory. Then when a certain trigger comes this buffer gets flushed to disk. The trigger can either be when the buffer reaches a certain size or when a certain period passes. These are all configurable from Postgres’s side. Once flushed to disk these entries can be announced committed.
The database can crash after writing WAL entries (before flushing shared buffer to disk), that is fine, as long we know the transaction state belonging to each WAL entry we can discard or omit uncommitted WAL entries upon recovery (to ensure data consistency).
For example if you are in the middle of a transaction and the database crashed, we consider the transaction rolled-back by default. I will do another article explaining how WAL actually writes transactions
When all the data files have been flushed and updated to reflect the information on the WAL This is something called checkpointing. Once this happens a checkpoint recordin the Write-Ahead Log (WAL) is recorded, marking the point up to which all changes have been flushed to disk.
In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the WAL (known as the redo record) from which it should start the REDO operation.