WAL - Distributed Design Patterns

WAL - Distributed Design Patterns

There are a lot of distributed system patterns, that are leveraged widely by different distributed systems. As part of this series, I'll go over some of the distributed design patterns, that will help you in developing your own distributed systems or understanding the ones you already use.

Strong durability is a default, that we assume out of all our data stores. But have you wondered, how this high durability is achieved with all the high performance/throughput requirements, even in the face of server crashes(power failures/os failures/hardware failures). Well, the answer is WAL.

WAL - Write Ahead Log

WAL isn't truly a distributed design pattern, but widely used pattern that is prevalent in most distributed systems guaranteeing durability.

WAL has been a common theme across traditional RDBMS systems, and is used to help with atomicity & durability(A & D of ACID) guarantees. All mutations to a table are written first to the WAL(Transaction Log/Bin Log) & then applied to the actual data files of the tables asynchronously.

Sample WAL and WALEntry structure:

type WAL struct {
    dir       string // directory under which WAL files are stored.

    file      *os.File // reference to the file

    metadata  []byte           // metadata recorded at the head of each WAL
    decoder   *decoder       // decoder to decode records
    encoder   *encoder // encoder to encode records
    
    mutex          sync.Mutex // To ensure single update per writer
    lastIndex      uint64   // index of the last entry saved to the WAL
}

type WALEntry struct {
    lsn      uint64 // unique identifier for each log entry
    data     []byte // actual WAL entry in bytes
    crc      uint32 // crc for data integrity validation
    type     uint32 // type of wal record  
}        


Life Without WAL

No alt text provided for this image

You must be wondering, why do we really need WAL? Why not flush the changes directly to the actual data files instead of WAL?

There are 2 aspects to it -

  1. A write to disk is never really flushed directly. The data goes through various buffers(RAM/Buffer Cache/Disk Cache), before truly being flushed to disk sectors. These caches help in reducing the number of disk writes, as they are expensive & hence help in improving performance. However, the downside to them is that if there are restarts/crashes, the data in these intermediate caches is lost, hence impacting durability of our data. If we start avoiding the caches, we'd be making disk flushes for each write & that would impact the performance and throughput of your system.
  2. As I mentioned above, disk writes are slow. Within the disk writes as well, sequential disk writes are much faster compared to random disk writes(applicable to SSD as well). If we are writing records to multiple tables/entities in our data store, chances are we'll be doing more random writes. Also, any write to disk for our data tables, might also need update to the on disk structures(Eg: Records need to be kept sorted by Clustered Indexes), update to on disk indexes and other auxiliary structures as well, directly impacting the throughput and performance.

WAL to the Rescue

No alt text provided for this image

WAL is an append only log, which stores each state change on the data store, as a command. Now, instead of flushing the data mutations on disk for different tables/entities, we just flush the command/operation in the WAL(1 disk operation).

A separate async process can read the operations from the WAL, and then apply the data mutations to the actual data files on the disk following the normal flow through the different caches. This really helps in improving the write throughput for the data stores.

Also, in case of failures, there could be changes that might not have been applied to the data files. However, since we have the operations present in the WAL file, we could replay the operations from WAL and apply them to bring the data store back to a consistent state. Thus, WAL helps us in ensuring data integrity and reliability, while still allowing high write throughput for our data stores.

Real Life Implementation Considerations

1. Flushing WAL Operations to Disk -

As mentioned before, writes to disk might not be flushed directly. Most File Handling libraries, offer you ways to force the OS to flush the changes directly to the disk sectors. While flushing on each write will give you strong durability guarantees, it can potentially cause performance issues in a write heavy system. This is where the tradeoff needs to be made, where you can have either a flush frequency or micro batches or both to flush the changes to disk, to help with the performance. Note that there is a risk of data loss here.

2. Corruption Detection -

We need to ensure that any operation flushed to disk is not corrupted. For this, we make sure that each WAL Record also contains a CRC value, which can be used to validate when the record is read from WAL and ensure that there is no corruption.

3. Operation Duplication -

Since WAL is an append only file, in case the client retries because of a communication failure, you can have scenarios of duplicate operations being written on the WAL. Hence whenever we are reading the WAL, to apply the changes to the actual data store, we need to ensure that either the duplicates are ignored, or the application of operations to actual data store is Idempotent.

Real Life Usage

1) All databases, including NoSQL databases like Cassandra make use of WAL to guarantee durability.

2) Kafka makes use of a similar structure as WAL(Commit Log).

3) KV Stores like Rocks DB, Level DB & Distributed caches like Apache Ignite also use WAL.

Summary

To sum it up, WAL allows us -

1) Faster performance & Throughput as we avoid data flush/disk writes for all changes.

2) Recoverability in case of restarts as operations can be applied from WAL to actual data store.

3) Ability to restore to a point in time snapshot, since we have all operations present in the WAL.

Thank you for reading! I'll be posting more such content on distributed systems & patterns, so please like, share and subscribe to this newsletter for notifications of new posts.

Please comment on the post with your feedback, will really help me improve! :)

Until next time, keep asking questions & keep learning!

Marcin Por?bski

Magician Architect with Midas touch

1 年

Once there's operation in the WAL updating record A=[1,2,3] updating it to A'=[2,4,6] which was not flushed, then if we read in another thread that record we will get A or A'? Is there some mechanism to check if there was operation on that record in WAL and then delay reading till WAL is flushed?

Samuel Palukuri, PMP?

Senior Project Manager at Franklin Templeton

2 年

Great article Pratik. Thank you. I can't remember the last time i did true programming but a couple of questions to help me understand the concept better please.. 1. Is this log(containing the actual WAL entry) just a file with strings (could be compressed/encoded)? 2. Is this log duplicated and stored on multiple nodes? What ensures the server this log sits on does not have the same issues that this log solves for? 3. In cases where BLOBs are stored on the Database, do they go through this as well? Thanks for sharing your knowledge with us

Krishna Tangirala

Technology and Delivery Leader @ S&P Global | Building technologically strategic solutions

2 年

This is a great start with a useful topic. While I knew about Commit logs, I didn't know the term "WAL" until I read this article. Pretty informative! Are there any libraries (Java/C# for e.g.) that help in implementing this design pattern keeping performance and throughput into consideration (as opposed to developing this from scratch)?

要查看或添加评论,请登录

Pratik Pandey的更多文章