What happens when databases crash?

What happens when databases crash?

This can make an interesting interview question.

Databases have tables and indexes stored in files and cached in memory aka as buffer pool. As you create rows, the database system writes the rows to data pages on memory first which is then eventually written to data files on disk.

Reading a page always checks the memory buffer pool. If it is not there it pulls the page from disk and place it in the buffer pool.

There is a problem though, what happens if you lose power half-way through writing to the data files? Some pages would have been written while others didn’t. When the database starts back up, we end up with data loss, or worse corruption.

Meet the WAL

Database folks quickly realized that they need something that would help with crashes and power loss, and that is the WAL (Write-ahead log) or Redo log.

What if as we write to tables and indexes data pages we create a log entry in the WAL of those changes. We write the WAL to its own files and also write to the data pages in memory. It is OK not write to the actual table and index data files on disk, those can stay in memory, as long as we have a log we can always construct the table.

The WAL is much smaller than the actual data files, and so we can flush them relatively faster. They are also sequential unlike data pages which are random. Disks love sequential writes, which is another benefit of WAL.

The WAL can also be used for all sort of things like replications and backup and yes crash recovery.

What if we crashed while writing the WAL?

WAL entries are also written to memory first and then flushed later based on the transactions commit. That is why the transaction state is critical.

The database can crash after writing WAL entries, that is fine, as long we know the transaction state belonging to each WAL entry we can discard or omit uncommitted WAL entries upon recovery.

For example if you are in the middle of a transaction and the database crashed, we consider the transaction rolled-back by default. WAL entries flushed by this uncommitted transaction will be discarded upon recovery.

But if you were able to issue a COMMIT and the WAL entry for a transaction commit makes it to disk and the client gets a success, that transaction is considered committed even if we crashed right after. The database remain consistent in that case.

Redoing the WAL

So we have established that WAL is the source of truth, we write the changes to data pages in memory of course (for on going transactions to use the latest and greatest) but we delay flushing the data pages to disk because the WAL made it to disk. The WAL is ahead of the data pages, thus the name, write ahead log.

Now we have data files on disk that are out of sync with what is in memory which is absolutely fine. As long as the database is running, we will only read from memory which has the latest. The problem is if the database crashes.

As the database starts back up, the file is out of date we can’t just pull it to memory and have clients read them, the WAL is the source of truth, we need to REDO the changes in the WAL back on the data files, and during that process nothing is allowed to read (otherwise we get corruption). The database startup time slows down the more out of sync the data files are from the WAL (many writes has happened but the data files were not updated for a long time).

Checkpointing

The question here can I control how often data pages are flushed to disk? And the answer is yes, this is called as checkpointing, where a new WAL record is created called checkpoint record right after the entire data pages in memory are flushed to disk.

Checkpointing however is an IO heavy operation, the data pages can be very large and very random.

Too much checkpointing sure makes start up faster, but it indeeds takes a toll on your I/O disk bandwidth which may take precious time from critical point reads or writes.

Some database systems have also undo logs, but that is for another post.

How often should we flush the WAL?

WAL contains changes from several transactions. If one of the transactions commit, we have to flush WAL to disk if we want to maintain durability. Of course this will flush other transactions data that may have not been committed which as we discussed is also fine, the transaction state saves us here.

But some databases expose certain configurations to relax how often the WAL is flushed. One of them being fsync, which ensures the contents of the WAL file is flushed to disk out of the OS page cache. Another is a transaction sibling commit delay, which waits for more transactions to commit so we incur fewer flushes.

Indeed database engineering is a fascinating world by itself. If you want to learn the fundamentals of database engineering check out my course at https://databases.win

Ali Ather

Java Full Stack Engineer | Core Java, J2EE, Spring Framework (Spring MVC & Spring Boot) | REST API, GraphQL, Microservices | MySQL, JPA, Docker, IoT Development | Agile Practitioner | Team Collaboration & Leadership

1 个月

Great explanation It was a new concept for me. Thanks for sharing

回复
K. Ram Babu

Senior Software Engineer at TMEIC | Ex. Cognizant | Python | SQL | Django

2 个月

Thanks for sharing. Loved your explanation. ??

回复
Devi Priya Ragireddy

Software Engineer at Factset | Data Engineer | PySpark | AWS | Azure | Bigdata

2 个月

Good one.Its interesting to know the internals??

Abiodun Alao

Backend Engineer

2 个月

Enlightened

回复

要查看或添加评论,请登录

Hussein Nasser的更多文章

  • A Story about Lunch and cache invalidation

    A Story about Lunch and cache invalidation

    A construction project at work has been blocking the main doorway to the cafeteria where we get lunch. For the first…

    11 条评论
  • The Six Connections Limit in Chromium Browsers

    The Six Connections Limit in Chromium Browsers

    A web application can be choked by Chrome’s HTTP/1.1 six connection per host limit.

    8 条评论
  • The Beauty of the WAL - A deep dive

    The Beauty of the WAL - A deep dive

    In any database system there are often two major storage units, data and indexes. Data represents tables, collections…

    11 条评论
  • What makes a good database engineer

    What makes a good database engineer

    The art of truly understanding database systems boils down to the following principles. You cannot do much with data on…

    18 条评论
  • A TCP option that improves Frontends and Backends latency

    A TCP option that improves Frontends and Backends latency

    99% of network latency issues are caused by the user app logic. App logic here includes libraries and frameworks used…

    34 条评论
  • What happens to a request before it’s processed?

    What happens to a request before it’s processed?

    When sending a request to a backend, we tend to focus on the processing aspect of the request, which is really the last…

    5 条评论
  • How to Become a Good Backend Engineer (Fundamentals)

    How to Become a Good Backend Engineer (Fundamentals)

    I have been a backend engineer for over 20 years and I have witness technologies come and go. One thing however, always…

    49 条评论
  • Postgres and MySQL, the main differences

    Postgres and MySQL, the main differences

    One of my udemy students asked a question about the difference between Postgres and MySQL. The answer turned out too…

    11 条评论
  • Good code, Bad code

    Good code, Bad code

    Code is just code, until bugs appear. Then we label it “bad code”.

    20 条评论
  • Avoid SELECT *, even on a single-column tables

    Avoid SELECT *, even on a single-column tables

    Try avoiding SELECT * even on single-column tables. Just keep that in mind even if you disagree.

    19 条评论

社区洞察

其他会员也浏览了