Using the CAP Theorem to Analyze Microservices
CAP Theorem: A Venn diagram where all 3 parties are never happy.

Using the CAP Theorem to Analyze Microservices

(Us) Engineers experience multiple learning curves and take multiple ownerships while building software and backend systems. As the ownership begins to expand we are involved in designing the architecture for software and their development lifecycle. The Architecture we design dictates customer experience and small flaws in our design leads to bad experience and failed products.

CAP theorem although a widely discussed and fundamental theorem in System design somehow is not reckoned with, while we design our software.

This article aims to touch upon

  1. What CAP theorem is
  2. How can CAP accelerate software success

In distributed systems, software is containerized and modularized into different microservices, decoupled databases, and multiple replicasets that communicate through networks. Networks & backends just like humans can have a breakdown at any time and need time to recover.

CAP in nutshell

CAP theorem also known as Brewer's theorem is a statement that "Distributed Systems can only provide two of three properties simultaneously: consistency, availability, and partition tolerance"

In a gist, there will always be tradeoffs that have to be made while designing software by prioritizing some functionality over others.

Understanding CAP

Understanding CAP theorem through daily Backend Transactions

In the figure above we have the following architecture

  1. Backend - That PUTS/GETS value from the database
  2. Distributed Database - Implemented in master-slave architecture. If you are not aware of the master-slave pattern I will explain it in gist for you. Master is the main database that takes care of the write (PUTS) operation and sends updated values to the slave replica sets and slave databases in multiple counts take care of the read (GETS) operation.

Situation: If the network communication breaks for the Master-Slave Database after a PUT operation has happened and the user does the GET for the same key.

We have two options

  • Give an outdated value in response to the user that is residing currently in the slave replica sets

2xx OK with stale data

  • Give a 500 response stating DB operation failed and retry in the given time.

5xx IE for inconsistent state

Option 1 is ensuring the system is up (Available) and responding to queries but is not in sync (Consistent) with the database, whereas the second option ensures we are responding only when the backend is in sync (Consistent) compromising with uptime (Availability)

Choosing uptime vs consistency will make major differences in customer experience. In order to make a choice you must understand what is the primary focus of customers. If you are planning to onboard customers with different focues i.e few choose uptime while others choose consistency you will have to trade off to eventual consistency but GOOD LUCK with that :)

Consistency, availability, and partition tolerance explained

Kepler's Critiques on CAP theorem

  • Consistency (C) - Consistency ensures any replica set that receives an individual query should return the exact same output every time. If a Database update has happened the backend should return the user the updated data and reject the stale value. Consistency doesn't guarantee that all requests will be served but ensures the requests that are served are receiving correct values.

Venn Diagrams Dammit

  • Availability (A)- Availability ensures that the system is reachable and performing almost ~100% of the time. Every request no matter some internal network failure or component outage should return HTTP-OK (200) response. Availability doesn't guarantee the output provided is up to date with the backend DB.
  • Partition Tolerance (P) - A system should not halt if there is a communication break between services. If we are ensuring partition tolerance it means that even if there is a network partition "breakage" between services the backend should not fail abruptly.In order to ensure Network partition doesn't halt the service we replicate all backends and their state data into multiple network subnets so that if one subnet fails we are good to use other subnets

If you are not convinced about why the CAP theorem states that at most 2 of 3 properties are ensured you should read this.

What to choose

Thinking on what to choose?

It's a fair assumption to always consider P (Partition tolerance) as an essential component as microservices heavily rely on networks for day-to-day communication.

Of A (Availability) or C (Consistency), what should we choose?

I will explain some examples of the tech giant Google, what they abide by, and why they prefer one over the other.

Google Search

  • Google search receives millions of hits every second.
  • only <0.001% (assuming) of people might be impacted if the latest articles are not shown in the search results
  • Thus, Availability has to be the primary focus as halting the system to update indices will drastically impact the customer experience and product adoption.

Gmail

  • Gmail is a communication tool and users rely heavily on communicating with the right data to reach the destination.
  • A user will be willing to wait rather than send broken emails with broken texts to the person

Communication tools like GMail and Facebook typically enforce only eventual consistency. Messages may take a while to travel between different servers, so different servers may temporarily have inconsistent views of the world depending on what messages they have seen so far. However, all servers will eventually have consistent states, since the correct state is defined by interpreting the messages in the order of their global timestamps, rather than the order in which they arrived.

Google Suite ( Docs, Sheets ), etc.

I believe Google Docs used to enforce consistency at the cost of availability (if your master was down then your doc wasn't available), but has now also moved to an eventual consistency model that allows concurrent editing and eventually resolves conflicts.


For Devs!

Multi-billion dollar companies resolve to basic expected experience for customers while choosing and prioritizing architectural designs.

Distributed Systems always have to deal with multiple fallacies assumed while designing architectures.

But now you understand the CAP theorem, I highly suggest you sketch a small diagram of your back end and evaluate whether all parameters of CAP are respected and whether they are in line with the end-user expectations.

The above exercise will help you evaluate some loopholes and design flaws in the back end. Some small improvements like internal caching, semaphores, and leader elections on multiple replica sets might help make the service better and reduce the impact network communication has on the overall service.

Happy Engineering ?? !!!!






Roman Siewko

Senior Vibe Coder | AI Therapist | DevOps Engineer

10 个月

For anyone who doubts, there is a "Beating the CAP Theorem Checklist" ?? Here is why your idea will not work: ? you are assuming that software/network/hardware failures will not happen ? you pushed the actual problem to another layer of the system ? your solution is equivalent to an existing one that doesn't beat CAP ? you're actually building an AP system ? you're actually building a CP system ? you are not, in fact, designing a distributed system Specifically, your plan fails to account for: ? latency is a thing that exists ? high latency is indistinguishable from splits or unavailability ? network topology changes over time ? there might be more than 1 partition at the same time ? split nodes can vanish forever ? a split node cannot be differentiated from a crashed one by its peers ? clients are also part of the distributed system ? stable storage may become corrupt ? network failures will actually happen ? hardware failures will actually happen ? operator errors will actually happen ? deleted items will come back after synchronization with other nodes ? clocks drift across multiple parts of the system, forward and backwards in time Source with complete list here ? https://ferd.ca/beating-the-cap-theorem-checklist.html

回复

要查看或添加评论,请登录

Navjot Bansal的更多文章

社区洞察

其他会员也浏览了