Using the CAP Theorem to Analyze Microservices
Navjot Bansal
Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist
(Us) Engineers experience multiple learning curves and take multiple ownerships while building software and backend systems. As the ownership begins to expand we are involved in designing the architecture for software and their development lifecycle. The Architecture we design dictates customer experience and small flaws in our design leads to bad experience and failed products.
CAP theorem although a widely discussed and fundamental theorem in System design somehow is not reckoned with, while we design our software.
This article aims to touch upon
In distributed systems, software is containerized and modularized into different microservices, decoupled databases, and multiple replicasets that communicate through networks. Networks & backends just like humans can have a breakdown at any time and need time to recover.
CAP in nutshell
CAP theorem also known as Brewer's theorem is a statement that "Distributed Systems can only provide two of three properties simultaneously: consistency, availability, and partition tolerance"
In a gist, there will always be tradeoffs that have to be made while designing software by prioritizing some functionality over others.
Understanding CAP
In the figure above we have the following architecture
Situation: If the network communication breaks for the Master-Slave Database after a PUT operation has happened and the user does the GET for the same key.
We have two options
Option 1 is ensuring the system is up (Available) and responding to queries but is not in sync (Consistent) with the database, whereas the second option ensures we are responding only when the backend is in sync (Consistent) compromising with uptime (Availability)
Choosing uptime vs consistency will make major differences in customer experience. In order to make a choice you must understand what is the primary focus of customers. If you are planning to onboard customers with different focues i.e few choose uptime while others choose consistency you will have to trade off to eventual consistency but GOOD LUCK with that :)
Consistency, availability, and partition tolerance explained
领英推荐
If you are not convinced about why the CAP theorem states that at most 2 of 3 properties are ensured you should read this.
What to choose
It's a fair assumption to always consider P (Partition tolerance) as an essential component as microservices heavily rely on networks for day-to-day communication.
Of A (Availability) or C (Consistency), what should we choose?
I will explain some examples of the tech giant Google, what they abide by, and why they prefer one over the other.
Google Search
Gmail
Communication tools like GMail and Facebook typically enforce only eventual consistency. Messages may take a while to travel between different servers, so different servers may temporarily have inconsistent views of the world depending on what messages they have seen so far. However, all servers will eventually have consistent states, since the correct state is defined by interpreting the messages in the order of their global timestamps, rather than the order in which they arrived.
Google Suite ( Docs, Sheets ), etc.
I believe Google Docs used to enforce consistency at the cost of availability (if your master was down then your doc wasn't available), but has now also moved to an eventual consistency model that allows concurrent editing and eventually resolves conflicts.
For Devs!
Multi-billion dollar companies resolve to basic expected experience for customers while choosing and prioritizing architectural designs.
Distributed Systems always have to deal with multiple fallacies assumed while designing architectures.
But now you understand the CAP theorem, I highly suggest you sketch a small diagram of your back end and evaluate whether all parameters of CAP are respected and whether they are in line with the end-user expectations.
The above exercise will help you evaluate some loopholes and design flaws in the back end. Some small improvements like internal caching, semaphores, and leader elections on multiple replica sets might help make the service better and reduce the impact network communication has on the overall service.
Happy Engineering ?? !!!!
Senior Vibe Coder | AI Therapist | DevOps Engineer
10 个月For anyone who doubts, there is a "Beating the CAP Theorem Checklist" ?? Here is why your idea will not work: ? you are assuming that software/network/hardware failures will not happen ? you pushed the actual problem to another layer of the system ? your solution is equivalent to an existing one that doesn't beat CAP ? you're actually building an AP system ? you're actually building a CP system ? you are not, in fact, designing a distributed system Specifically, your plan fails to account for: ? latency is a thing that exists ? high latency is indistinguishable from splits or unavailability ? network topology changes over time ? there might be more than 1 partition at the same time ? split nodes can vanish forever ? a split node cannot be differentiated from a crashed one by its peers ? clients are also part of the distributed system ? stable storage may become corrupt ? network failures will actually happen ? hardware failures will actually happen ? operator errors will actually happen ? deleted items will come back after synchronization with other nodes ? clocks drift across multiple parts of the system, forward and backwards in time Source with complete list here ? https://ferd.ca/beating-the-cap-theorem-checklist.html