Understanding Inconsistency with SUDs
This article shows why inconsistency and latency are fundamental when building distributed systems and how PACELC and the CAP theorem help guide us.
PACELC [2] builds on the CAP theorem to consider the non-partitioned communication case in a distributed system. This is referred to as normal operation in the quote below.
All in all PACELC Theorem gives a designer two modes of operation, with and without partitioning. With partitioning, one must choose between consistency and availability otherwise in the case of normal operation one must choose between latency and consistency [3].
A Distributed Storage System
We want to provide a data store that is both available 99.999% of the time and can service consistent data reads for up to 100,000 simultaneous clients.
A service that is available 99.999% of the time is not available for 5.26 minutes per year [4].
One way to achieve this level of up-time is to buy a single machine that meets our availability requirement with enough resources to service all the clients.
However, imagine if our service became popular, growing to 1,000,000 simultaneous calls. We would have to buy a new machine that was ten times as capable. This would be a significant expense and could take a long time to deliver and get live.
Such a machine is also a single point of failure. If something adverse happens, there is a risk that the whole machine will not be available. Even though the machine can sustain 99.999% availability, the environment it is running in is susceptible to failure (air conditioning and networking infrastructure fail).
An alternative approach is to build a distributed system from a collection of standard components: CPUs, networks, memory and storage. The unit of failure is smaller so that the entire system is not brought down if a single CPU or disk fails. We then require an algorithm to provide a storage solution across such a distributed system, one that makes the separate storage nodes function as a single system.
The Distributed Storage System
Given the above requirements, we put this design into place:
A storage node (S1, S2, S3) contains a single integer. A writer node updates the integer in all storage nodes. How this works is below, but first, some important assumptions.
It is assumed that errors do not occur. This is so we can see that the distributed algorithm works. Error handling makes distributed algorithms complex, mostly because a significant number of failure cases can obscure the goal of the code. Building an algorithm to run over a number of networked components assuming that there are no errors is not realistic, as the one thing that characterises a distributed system is partial failure [6].
It is assumed that storage nodes always retain their integer in memory, that such nodes have access to persistent storage, and that a node can return the integer to a storage-read client at any time. These clients are not shown.
It is assumed that the single writer updates infrequently enough to reach consistency before another update is sent. This is so we do not have to distinguish between updates.
All messages are assumed to be sent synchronously, delivered reliably, and in order from node to node.
It is assumed that messages are delivered in a "timely" manner. This means that they are delivered quickly enough that the algorithm does not break.
A mechanism is assumed to be available so the writer (and other nodes) can detect available storage nodes.
It is assumed that 60 storage nodes are always running, so we do not have to handle the case of nodes coming and going.
Each storage node is assumed to handle at least 1,667 simultaneous data read requests and that all other infrastructure can handle this level of traffic.
In a real-world deployment, the error-free assumption cannot be made, processes, operating systems, VMs, and hardware develop faults. Therefore, any distributed algorithm must be written for its environment. The assumptions above define the environment for SUDs.
The SUDs Algorithm
The writer sends a value to S1, which starts an update at that storage node. The current value is replaced with this new value in memory and local persistent storage. The storage node returns success to the writer.
The writer successfully sends the value to all other storage nodes.
The writer then sends "done" to all the storage nodes, receiving success from each one.
When a new value is received at a storage node, that node temporarily stops serving integer read requests until the node gets the "done" message.
SUDs Inconsistency
The storage system becomes inconsistent when the writer sends the first updated value to S1. Let's assume this is a '1'.
If a client reads from S1, it will receive a 1. If that client reads from any other node, it will receive a 0. And there is no way for the client to tell which of these two values is the most recent.
We require that the sixty machines act as if they were one single system. However, data replication makes it possible to receive an inconsistent result.
This is why the storage node temporarily suspends responding to clients that request the integer. An update has started, so the value currently in memory is no longer the latest and cannot be returned.
The Window of Inconsistency
The writer updates S1, S2, and S3 up to S60. Each update takes time; to propagate the message from the writer, perform the update at the storage node, and return success to the writer.
The overall storage system (the set of nodes, S1 to S60) is inconsistent during this time. There is a mixture of old values (0s) and new values (1s) that any client could read.
Closing the Window
The SUDs algorithm starts a change at an S node, the update is performed, and "done" is sent.
领英推荐
As there are no errors in this system, when S1 receives its "done" message, we can be confident that all sixty nodes have the updated value in memory.
After receiving "done", each storage node may start serving the updated value, safe in the knowledge that only this updated value will eventually be served from all other nodes when they receive their "done" message. This is safe as there are no errors in this system. Thus, the sixty nodes will all serve the same value without inconsistency.
Performance Considerations
In this algorithm, storage nodes temporarily stop serving integers after they have been updated until they receive "done", as illustrated below.
If a storage-read client contacts S1 during an update, the client must wait until the first "done" is received in the second window. A client contacting S60 will have to wait for an amount of time as long as both windows.
On average, a storage-read client will have to wait for the window of inconsistency, plus about half of the system delay window until an appropriate "done" message is propagated.
PACELC Latency Tradeoff
The width of both windows is an example of the tradeoff discussed by the ELC part of PACELC, which states that there is a balance to be struck between latency and consistency in normal operation. In the treatment in this article, as there are no errors and, therefore, no network partition, the PAC part does not apply.
While SUDs is updating all storage nodes, a storage read-client must wait until the "done" message is received at that storage node. This is an example of the PACELC latency referred to in [3]. This waiting is latency introduced by the algorithm. There is also network latency due to message propagation.
If there were only three storage nodes, both windows would be smaller, and the latency for a storage-read client would be less. However, availability would be reduced as only three nodes, not sixty, are now available. In addition, each of the three nodes would need to sustain up to 33,334 simultaneous read operations.
If there are fewer storage nodes, the inconsistency is resolved more quickly, but there is less availability and, therefore, reduced tolerance to inevitable storage-node errors.
CAP-based Consistency and Availability Tradeoff
In this not realistic, error-free version of the world, we can have both eventual consistency and availability. However, imagine if a network partition occurred, placing ten storage nodes in one network and fifty in another.
The writer node is within one of those two partitions and will update all 0s to 1s for ten storage nodes.
However, we have two separate sub-systems. One has ten storage nodes that all contain 1s. The other sub-system has fifty nodes, all containing 0s, and these fifty values will only change if SUDs can handle network partitioning to safely update all of them when the network partition is resolved.
Living with Inconsistency
SUDs ensures consistency by not serving data during an update, but it is an eventual consistency algorithm under the covers as updated data is gradually propagated across the distributed system.
Not serving data during an update causes a storage-read client to wait.
If there are sixty nodes to keep consistent and the average update round-trip time from the writer to a storage node is 50ms, updating 60 with our synchrony assumption will take a minimum of 3,000ms (3 seconds).
The total average delay for a storage-read client will be 3,000ms plus about half the time associated with the system delay window, as stated above. Assuming that window is another 3,000ms, the total delay in updating 60 nodes will be about 4,500ms, 4.5s. In computing, that is a long time. That is enough time to say out loud, "Do not go where the path may lead; go instead where there is no path and leave a trail.", Ralph Waldo Emerson [5].
SUDs could be made more efficient by doing away with the "done" message and focussing on single updates. However, using "done" enables a number of updates to be performed before the node is informed that no more updates will be forthcoming in that group.
The storage-read clients rely on SUDs to provide data for their purposes. Depending on the nature of what the storage-read nodes are doing, they may be able to tolerate a temporary inconsistency. If they can, the 4.5s delay does not have to be paid, but the storage-read client must be able to tolerate being served an out-of-date integer.
If the storage-read clients used SUDs to serve the current time of day, inconsistency is unlikely to be tolerable, especially as time might appear to go backwards. But if SUDs served a personnel record where some fields may be incomplete, the storage-read client might be more able to tolerate this inconsistency. In short, inconsistent data can be tolerated if the application consuming that data can live with it. Fundamentally, this is an application-level issue.
Time might appear to go backwards because a storage-read client reads S1 and is served 12:00, the current value across all sixty nodes. Sixty seconds later, the storage read client reads S1 and is served 12:01, and 4s later, reads S60 and is served 12:00. This is because the new time of 12:01 has yet to reach S60. In short, you would use something other than SUDs, as described in this article, to serve the time.
Inconsistency is Central to Distributed Systems
We have achieved high availability in the architecture above by providing multiple storage nodes. As we have multiple nodes, SUDs must replicate data and the algorithm does it in a way that prevents data inconsistency being seen at a storage-read client. Such a client will only see the latest data and may have to wait for consistency across all storage nodes.
The more storage nodes you have, the better your availability. However, a high availability requirement implies that data must be replicated across the multiple nodes that provide that degree of availability.
As soon as a distributed system replicates data, a tradeoff between consistency and latency arises.
Consistency is fundamental because of data replication, as high availability is required. Latency is also fundamental; it cannot be reduced to zero because message propagation takes time.
If the system you are using lets your code observe inconsistent data rather than making your code wait, you might have a hard-to-find bug. It might be a good place to start if you can test with inconsistency configured to 'off'.
Resources
[6] Distributed computing - Wikipedia partial failure is referred to as independent failure