How PostgreSQL stores data in files, called forks
Thank you so much for reading this edition of the newsletter ?? If you found it interesting, you will also love my courses
How PostgreSQL stores data in files, called forks
Physical files (present in the PGDATA directory) are called Forks and PostgreSQL splits the data into multiple forks to manage and optimize different aspects of data storage and retrieval. The three types of forks are -
The file grows over time, and when its size reaches 1GB, another file of this fork (called segment) is created and the sequence number is added to the end of its filename. The limit can be changed while building PostgreSQL.
Each row is stored in a data page (~8 KB in size but configurable), and these pages are linked together to form the complete table. When inserting new data, PostgreSQL first consults the FSM fork to find pages with enough free space. It then writes the new row into the appropriate page in the main fork and updates the FSM.
Note: In PostgreSQL, the physical order of rows on the disk can differ from the logical order defined by the primary key. To physically arrange rows on a disk according to the order of an index (such as the primary key), PostgreSQL offers the CLUSTER command.
Updates are treated as a combination of insert and delete operations. PostgreSQL inserts the new version of the row into the main fork and marks the old version as obsolete. The FSM and VM forks are updated to reflect these changes.
By the way,
Being hands-on is the best way for you to learn. Practice interesting programming challenges like building your own BitTorrent client, Redis, DNS server, and even SQLite from scratch on CodeCrafters.
?? Video I posted this week
This week I posted How LinkedIn improved their latency by 60%
LinkedIn reduced its latency by 60%. They recently published a blog explaining how they reduced latencies for their inter-service communication by 60% and I dissected it and compiled my learnings in a quick video.
领英推荐
?? Paper I read this week
This week I spent reading Serverless Runtime / Database Co-Design With Asynchronous I/O
This week I am reading a research paper that shows a 100x reduction in tail latencies by keeping database IO asynchronous.
The traditional approach, like using SQLite, leverages synchronous IO which blocks the runtime during database interactions, hurting concurrency and scalability – not ideal for serverless with its multi-tenant nature.
The paper talks about rearchitecting SQLite to be asynchronous i.e. the database interactions wouldn't block the runtime, freeing it to handle other tasks. As per the paper, this improvement enables low-latency access for crucial latency-sensitive workloads running serverless or on edge.
You can download this and other papers I recommend from my papershelf.
Redis is written in C, but its test cases are written in TCL
While going through Redis internals, I looked at test cases to understand the flow, I was surprised to see that the test cases were not written in C, but in TCL, making the entire suite highly readable and extremely simple.
Digging deeper I found out that TCL is pretty popular as a language to test network applications, even SQLite uses it in its test suite. Pretty interesting usecase for a language created way back in 1988 :)
?? Interesting articles I read this week
I read a few engineering blogs almost every single day, and here are the three articles I would recommend you to read.
Thank you so much for reading this edition of the newsletter ?? If you found it interesting, you will also love my courses
Staff SDE @EDB | PostgreSQL?? Developer
8 个月kudos Arpit Bhayani great topic and you presented it crisply,you can also add about INIT FORK.