CAP Theorem – Demystified
Taher Borsadwala
Blockchain & Digital Assets Platform Products at BNY | FinTech Solutions
My better half decides to have a serious conversation with me. She calls me on cell.
Cell reception at my end is sporadic.
Scenario 1:
I receive her call.
Owing to bad reception though, I am unable to communicate clearly with her.
I move around to get better reception. I probably switch to VOIP calling. I communicate through SMS.
I choose not to respond until I have clearly understood the ask.
Else I let her know that I am unable to hear her.
I respond only once the communication is clear.
I engage. I understand the ask. My responses are active and current.
That’s consistency.
Scenario 2:
I receive her call.
Owing to bad reception though, I am unable to communicate clearly with her.
I think back on prior conversations. I go back to the last ask & assume context.
I give her occasional hmm, huh, etc. I progress the conversation but without current context.
I engage. My responses though are “stale”.
That’s availability.
If we would have been in the same room physically then above challenges would have been void. And that’s the important factor – we are in different locations.
We are “distributed”.
We are dependent on the cell service. We are dependent on the network. We need to tolerate network problems.
That’s partition tolerance (network failure).
Bad cellular network being the constant, I can either be consistent or I can simply be available for the conversation. But not both.
And that’s CAP theorem for you. It states, courtesy Wikipedia: “In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: [1] consistency, [2] availability, [3] partition tolerance.”.
Above definition is misleading though.
CAP theorem is not about 2 out of 3 but rather 1 out of 2 wherein 3rd is the constant.
Let’s elaborate.
Assume we have a single data store with no replication. All reads and writes happen to that same data store. But then that single data store can potentially become a single point of failure. Also, it does not support scalability.
And so, to increase the fault tolerance, that single data store is now replicated. In addition to replication, sharding is implemented ie. the data is now truly distributed.
Major challenge of such distributed data stores is affecting a write onto all data stores. In case of network failure (partition), the latest write does not make its way to all the data stores. If a client reads from such data-stores that are missing the latest writes/updates then retrieval function has 2 options:
- Ensure that the data is in eventual-consistent state. If unable to ensure then return an error. Return value only if sure. CONSISTENCY.
- Don’t care about data’s latest state. Simply return the last known value. AVAILABILITY.
Network failure in above is the CONSTANT.
This is what CAP theorem means: Distributed systems have to be partition tolerant ie. respect and accommodate network outages – whether such tolerance leads to consistency or availability is use-case dependent.
I used a term above called “eventual consistency”.
Though “Eventual” and “Consistent” sound like an oxymoron, it’s a by-product of distributed systems.
It’s a fact, like real-time or near-real-time. Eventual consistency means that the distributed data stores will eventually become consistent and that too in a sequential manner ie. the sequence of the incoming events will be respected – ordering!
Example:
- All data-stores have the fact “I love thin crust pizza”.
- I update my pizza preference as “I love deep dish pizza”
- Network partition occurs: Data-store-1 is now updated to convey “I love deep dish pizza”
- Network partition occurs: Data-store-2 still believes that I make healthy eating choices “I love thin crust pizza”
- Maybe this is on a social media site, people with access to data-store-2 consider me healthy while the reality is otherwise
- And this is where eventual-consistency plays its part and ensures that data-store-2 is also made aware at some point that my eating habits have gone for a toss
There are structures like conflict-free replicated data types (CRDTs) that facilitate ordering in the eventually consistent state requirement. Above example is an over-simplification. Assume that different data-stores somehow got different values for the same object but were unable to interact with each other at those points. Once eventual consistent state is gained, the ordering becomes important to ensure that the final value of the object is indeed the final intended value by the client.
Back to CAP Theorem. With data intensive and distributed applications on the rise, fueled by growing cloud provisioning, data persistence across multiple storage systems has become a norm.
Because the data stores are distributed, they are victims of network failure and hence need to plan for their partition tolerance strategy: Consistency OR Availability.
Preference is scenario/requirement based. For transactions of financial nature, consistency is key. For social media data sets, availability is key. Again, no clear-cut approach. It’s all dependent on the use-case.
For instance, let’s assume that banks favored availability over consistency.
- I transfer money from my account and log off. Amount debited from my account is persisted to data-store-1
- I log in again and probably hit a different data store this time, say data-store-2
- Owing to network issues, the debit that got affected to data-store-1 never reached data-store-2
- But hey, the bank favors availability and so who are we to argue!?!
- Since availability is the preference, I get to see “stale” account balance, in this case, I see my balance from before the debit on step 1
- I am happy – I run to the ATM to withdraw the tech-induced bonus
- Now whether the ATM connects to data-store-1 or to data-store-2 or data-store-n is up to you – feel free to move this story forward per your preference
Above scenario is just to highlight the difference between consistency and availability. Both are equally important, one more than the other for some…
From an implementation perspective, HBase, MongoDB, Redis support consistency while Cassandra is a highly available database.
Before concluding, let’s review whether CAP theorem highlights a limitation or not.
In a way, yes, it does. But this limitation is not about a choice between consistency or availability. It’s simply a choice between speed/performance or the lack of it. It’s all about improving the response times. Coz eventual-consistency as-it-is guarantees ordered consistency along with availability.
Ordered eventual consistency is not easy to implement – it’s a combination of multiple techniques:
- distributed databases
- replication models
- distributed locking
- consensus algorithms
- uniqueness constraints
I am targeting to cover some of these techniques in my upcoming blogs.
Hope you find this useful. Open to review/feedback.
Thank you!