登录查看更多内容

Understanding Inconsistency with SUDs

Huw Evans

Head of Retail Engineering at Fruugo.com

发布日期: 2023年12月1日

This article shows why inconsistency and latency are fundamental when building distributed systems and how PACELC and the CAP theorem help guide us.

PACELC [2] builds on the CAP theorem to consider the non-partitioned communication case in a distributed system. This is referred to as normal operation in the quote below.

All in all PACELC Theorem gives a designer two modes of operation, with and without partitioning. With partitioning, one must choose between consistency and availability otherwise in the case of normal operation one must choose between latency and consistency [3].

A Distributed Storage System

We want to provide a data store that is both available 99.999% of the time and can service consistent data reads for up to 100,000 simultaneous clients.

A service that is available 99.999% of the time is not available for 5.26 minutes per year [4].

One way to achieve this level of up-time is to buy a single machine that meets our availability requirement with enough resources to service all the clients.

However, imagine if our service became popular, growing to 1,000,000 simultaneous calls. We would have to buy a new machine that was ten times as capable. This would be a significant expense and could take a long time to deliver and get live.

Such a machine is also a single point of failure. If something adverse happens, there is a risk that the whole machine will not be available. Even though the machine can sustain 99.999% availability, the environment it is running in is susceptible to failure (air conditioning and networking infrastructure fail).

An alternative approach is to build a distributed system from a collection of standard components: CPUs, networks, memory and storage. The unit of failure is smaller so that the entire system is not brought down if a single CPU or disk fails. We then require an algorithm to provide a storage solution across such a distributed system, one that makes the separate storage nodes function as a single system.

The Distributed Storage System

Given the above requirements, we put this design into place:

A storage node (S1, S2, S3) contains a single integer. A writer node updates the integer in all storage nodes. How this works is below, but first, some important assumptions.

It is assumed that errors do not occur. This is so we can see that the distributed algorithm works. Error handling makes distributed algorithms complex, mostly because a significant number of failure cases can obscure the goal of the code. Building an algorithm to run over a number of networked components assuming that there are no errors is not realistic, as the one thing that characterises a distributed system is partial failure [6].

It is assumed that storage nodes always retain their integer in memory, that such nodes have access to persistent storage, and that a node can return the integer to a storage-read client at any time. These clients are not shown.

It is assumed that the single writer updates infrequently enough to reach consistency before another update is sent. This is so we do not have to distinguish between updates.

All messages are assumed to be sent synchronously, delivered reliably, and in order from node to node.

It is assumed that messages are delivered in a "timely" manner. This means that they are delivered quickly enough that the algorithm does not break.

A mechanism is assumed to be available so the writer (and other nodes) can detect available storage nodes.

It is assumed that 60 storage nodes are always running, so we do not have to handle the case of nodes coming and going.

Each storage node is assumed to handle at least 1,667 simultaneous data read requests and that all other infrastructure can handle this level of traffic.

In a real-world deployment, the error-free assumption cannot be made, processes, operating systems, VMs, and hardware develop faults. Therefore, any distributed algorithm must be written for its environment. The assumptions above define the environment for SUDs.

The SUDs Algorithm

The writer sends a value to S1, which starts an update at that storage node. The current value is replaced with this new value in memory and local persistent storage. The storage node returns success to the writer.

The writer successfully sends the value to all other storage nodes.

The writer then sends "done" to all the storage nodes, receiving success from each one.

When a new value is received at a storage node, that node temporarily stops serving integer read requests until the node gets the "done" message.

SUDs Inconsistency

The storage system becomes inconsistent when the writer sends the first updated value to S1. Let's assume this is a '1'.

If a client reads from S1, it will receive a 1. If that client reads from any other node, it will receive a 0. And there is no way for the client to tell which of these two values is the most recent.

We require that the sixty machines act as if they were one single system. However, data replication makes it possible to receive an inconsistent result.

This is why the storage node temporarily suspends responding to clients that request the integer. An update has started, so the value currently in memory is no longer the latest and cannot be returned.

The Window of Inconsistency

The writer updates S1, S2, and S3 up to S60. Each update takes time; to propagate the message from the writer, perform the update at the storage node, and return success to the writer.

The overall storage system (the set of nodes, S1 to S60) is inconsistent during this time. There is a mixture of old values (0s) and new values (1s) that any client could read.

The Window of Inconsistency as updated data is Propagated Across the Distributed System

Closing the Window

The SUDs algorithm starts a change at an S node, the update is performed, and "done" is sent.

领英推荐

Introduction to Error Detection and Correction #1:…

Simon Southwell 2 年前

The many ways to mess up ready/valid handshaking

Lukas Vik 3 个月前

It's the assumptions stupid!! OSPF vs BGP4 part two.

Mark Seery 1 个月前

As there are no errors in this system, when S1 receives its "done" message, we can be confident that all sixty nodes have the updated value in memory.

After receiving "done", each storage node may start serving the updated value, safe in the knowledge that only this updated value will eventually be served from all other nodes when they receive their "done" message. This is safe as there are no errors in this system. Thus, the sixty nodes will all serve the same value without inconsistency.

Performance Considerations

In this algorithm, storage nodes temporarily stop serving integers after they have been updated until they receive "done", as illustrated below.

If a storage-read client contacts S1 during an update, the client must wait until the first "done" is received in the second window. A client contacting S60 will have to wait for an amount of time as long as both windows.

On average, a storage-read client will have to wait for the window of inconsistency, plus about half of the system delay window until an appropriate "done" message is propagated.

PACELC Latency Tradeoff

The width of both windows is an example of the tradeoff discussed by the ELC part of PACELC, which states that there is a balance to be struck between latency and consistency in normal operation. In the treatment in this article, as there are no errors and, therefore, no network partition, the PAC part does not apply.

While SUDs is updating all storage nodes, a storage read-client must wait until the "done" message is received at that storage node. This is an example of the PACELC latency referred to in [3]. This waiting is latency introduced by the algorithm. There is also network latency due to message propagation.

If there were only three storage nodes, both windows would be smaller, and the latency for a storage-read client would be less. However, availability would be reduced as only three nodes, not sixty, are now available. In addition, each of the three nodes would need to sustain up to 33,334 simultaneous read operations.

If there are fewer storage nodes, the inconsistency is resolved more quickly, but there is less availability and, therefore, reduced tolerance to inevitable storage-node errors.

CAP-based Consistency and Availability Tradeoff

In this not realistic, error-free version of the world, we can have both eventual consistency and availability. However, imagine if a network partition occurred, placing ten storage nodes in one network and fifty in another.

The writer node is within one of those two partitions and will update all 0s to 1s for ten storage nodes.

However, we have two separate sub-systems. One has ten storage nodes that all contain 1s. The other sub-system has fifty nodes, all containing 0s, and these fifty values will only change if SUDs can handle network partitioning to safely update all of them when the network partition is resolved.

Living with Inconsistency

SUDs ensures consistency by not serving data during an update, but it is an eventual consistency algorithm under the covers as updated data is gradually propagated across the distributed system.

Not serving data during an update causes a storage-read client to wait.

If there are sixty nodes to keep consistent and the average update round-trip time from the writer to a storage node is 50ms, updating 60 with our synchrony assumption will take a minimum of 3,000ms (3 seconds).

The total average delay for a storage-read client will be 3,000ms plus about half the time associated with the system delay window, as stated above. Assuming that window is another 3,000ms, the total delay in updating 60 nodes will be about 4,500ms, 4.5s. In computing, that is a long time. That is enough time to say out loud, "Do not go where the path may lead; go instead where there is no path and leave a trail.", Ralph Waldo Emerson [5].

SUDs could be made more efficient by doing away with the "done" message and focussing on single updates. However, using "done" enables a number of updates to be performed before the node is informed that no more updates will be forthcoming in that group.

The storage-read clients rely on SUDs to provide data for their purposes. Depending on the nature of what the storage-read nodes are doing, they may be able to tolerate a temporary inconsistency. If they can, the 4.5s delay does not have to be paid, but the storage-read client must be able to tolerate being served an out-of-date integer.

If the storage-read clients used SUDs to serve the current time of day, inconsistency is unlikely to be tolerable, especially as time might appear to go backwards. But if SUDs served a personnel record where some fields may be incomplete, the storage-read client might be more able to tolerate this inconsistency. In short, inconsistent data can be tolerated if the application consuming that data can live with it. Fundamentally, this is an application-level issue.

Time might appear to go backwards because a storage-read client reads S1 and is served 12:00, the current value across all sixty nodes. Sixty seconds later, the storage read client reads S1 and is served 12:01, and 4s later, reads S60 and is served 12:00. This is because the new time of 12:01 has yet to reach S60. In short, you would use something other than SUDs, as described in this article, to serve the time.

Inconsistency is Central to Distributed Systems

We have achieved high availability in the architecture above by providing multiple storage nodes. As we have multiple nodes, SUDs must replicate data and the algorithm does it in a way that prevents data inconsistency being seen at a storage-read client. Such a client will only see the latest data and may have to wait for consistency across all storage nodes.

The more storage nodes you have, the better your availability. However, a high availability requirement implies that data must be replicated across the multiple nodes that provide that degree of availability.

As soon as a distributed system replicates data, a tradeoff between consistency and latency arises.

Consistency is fundamental because of data replication, as high availability is required. Latency is also fundamental; it cannot be reduced to zero because message propagation takes time.

If the system you are using lets your code observe inconsistent data rather than making your code wait, you might have a hard-to-find bug. It might be a good place to start if you can test with inconsistency configured to 'off'.

Resources

[1] CAP theorem - Wikipedia

[2] PACELC theorem - Wikipedia

[3] PACELC Theorem - GeeksforGeeks

[4] High availability - Wikipedia

[5] The Most Famous Quotes of All Time | Keep Inspiring Me

[6] Distributed computing - Wikipedia partial failure is referred to as independent failure

要查看或添加评论，请登录

Huw Evans的更多文章

The Joys of Caching

2024年4月15日

The Joys of Caching

Caching data can improve system performance. Let's take a look.
Debugging and the Scientific Method

2024年3月25日

Debugging and the Scientific Method

The scientific method helps you gain knowledge [1]. You make an observation and test it with an experiment that shows…
Debugging

2024年3月14日

Debugging

This is typically how I go about debugging a piece of code: What is wrong? Reproducing the error Finding the source of…
Software Engineering builds two Things

2023年11月24日

Software Engineering builds two Things

When we write software, we build two things. The software that provides the business solution.
Smaller teams are more reactive

2023年11月8日

Smaller teams are more reactive

On November 2 2022, I wrote an article on how I had recruited 12 new employees. This article covers what happened next.
Lazily filtering out non-Cats

2023年9月5日

Lazily filtering out non-Cats

In a previous article, I discussed how to safely generate a list of subtypes from an original list defined on a…
Cats are not Dogs

2023年9月2日

Cats are not Dogs

Who in life has not tried to do this? Trying to treat a list of Animal as a list of a subtype. This does not compile in…
Failure is a subtype of Success

2023年7月15日

Failure is a subtype of Success

This article considers how to cleanly handle both the failure and success paths in code, taking a look at how Java's…
Teaching Agile gives student a fish

2023年6月30日

Teaching Agile gives student a fish

Teaching a student or colleague agile software development or more generally agile project management gives them a…
Agile Manifesto #5

2023年6月16日

Agile Manifesto #5

The Agile Manifesto [1] states that the left-hand side of the following are preferred over the right-hand side:…

See all articles

Understanding Inconsistency with SUDs

Huw Evans

Head of Retail Engineering at Fruugo.com

A Distributed Storage System

The Distributed Storage System

The SUDs Algorithm

SUDs Inconsistency

The Window of Inconsistency

Closing the Window

领英推荐

Performance Considerations

PACELC Latency Tradeoff

CAP-based Consistency and Availability Tradeoff

Living with Inconsistency

Inconsistency is Central to Distributed Systems

Resources

Huw Evans的更多文章

社区洞察

其他会员也浏览了

SPSC Queue Part 2: Going Atomic

Unpacking the Three Laws of the Edge: Insights from Intel Vision 2024

Advanced Attention Mechanisms — II

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Back to Bytes: A Time Capsule of Computing and A.I. in 1984 (part i of iii)

Logic Gates: The Building Blocks of Digital Circuits

Distributed scheduling of quantum circuits with noise and time optimization

Memory mapping in DDR is essential for defining how data is stored, accessed, and managed in the physical address space of DRAM

Byzantine Fault Tolerance

Harnessing the Power of Cache Coherency in CXL

A Distributed Storage System

The Distributed Storage System

The SUDs Algorithm

SUDs Inconsistency

The Window of Inconsistency

Closing the Window

领英推荐

Performance Considerations

PACELC Latency Tradeoff

CAP-based Consistency and Availability Tradeoff

Living with Inconsistency

Inconsistency is Central to Distributed Systems

Resources

Huw Evans的更多文章

The Joys of Caching

Debugging and the Scientific Method

Debugging

Software Engineering builds two Things

Smaller teams are more reactive

Lazily filtering out non-Cats

Cats are not Dogs

Failure is a subtype of Success

Teaching Agile gives student a fish

Agile Manifesto #5

社区洞察

其他会员也浏览了

SPSC Queue Part 2: Going Atomic

Unpacking the Three Laws of the Edge: Insights from Intel Vision 2024

Advanced Attention Mechanisms — II

DDR prefetching is a technique used in computer architecture to improve system performance by predicting and fetching data from memory.

Back to Bytes: A Time Capsule of Computing and A.I. in 1984 (part i of iii)

Logic Gates: The Building Blocks of Digital Circuits

Distributed scheduling of quantum circuits with noise and time optimization

Memory mapping in DDR is essential for defining how data is stored, accessed, and managed in the physical address space of DRAM

Byzantine Fault Tolerance

Harnessing the Power of Cache Coherency in CXL